Author: hshreedharan
Date: Wed May 21 22:34:24 2014
New Revision: 1596704
URL: http://svn.apache.org/r1596704
Log:
Flume 1.5.0 release
Modified:
flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst
flume/site/trunk/content/sphinx/FlumeUserGuide.rst
flume/site/trunk/content/sphinx/download.rst
flume/site/trunk/content/sphinx/index.rst
Modified: flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst
URL:
http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst (original)
+++ flume/site/trunk/content/sphinx/FlumeDeveloperGuide.rst Wed May 21 22:34:24
2014
@@ -15,7 +15,7 @@
======================================
-Flume 1.4.0 Developer Guide
+Flume 1.5.0 Developer Guide
======================================
Introduction
@@ -166,7 +166,7 @@ RPC clients - Avro and Thrift
As of Flume 1.4.0, Avro is the default RPC protocol. The
``NettyAvroRpcClient`` and ``ThriftRpcClient`` implement the ``RpcClient``
interface. The client needs to create this object with the host and port of
-the target Flume agent, and canthen use the ``RpcClient`` to send data into
+the target Flume agent, and can then use the ``RpcClient`` to send data into
the agent. The following example shows how to use the Flume Client SDK API
within a user's data-generating application:
Modified: flume/site/trunk/content/sphinx/FlumeUserGuide.rst
URL:
http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/FlumeUserGuide.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/FlumeUserGuide.rst (original)
+++ flume/site/trunk/content/sphinx/FlumeUserGuide.rst Wed May 21 22:34:24 2014
@@ -15,7 +15,7 @@
======================================
-Flume 1.4.0 User Guide
+Flume 1.5.0 User Guide
======================================
Introduction
@@ -128,7 +128,7 @@ Setting up an agent
-------------------
Flume agent configuration is stored in a local configuration file. This is a
-text file which has a format follows the Java properties file format.
+text file that follows the Java properties file format.
Configurations for one or more agents can be specified in the same
configuration file. The configuration file includes properties of each source,
sink and channel in an agent and how they are wired together to form data
@@ -705,6 +705,8 @@ ssl false Set th
keystore -- This is the path to a Java keystore file.
Required for SSL.
keystore-password -- The password for the Java keystore. Required
for SSL.
keystore-type JKS The type of the Java keystore. This can be
"JKS" or "PKCS12".
+ipFilter false Set this to true to enable ipFiltering for
netty
+ipFilter.rules -- Define N netty ipFilter pattern rules with
this config.
================== ===========
===================================================
Example for agent named a1:
@@ -718,6 +720,21 @@ Example for agent named a1:
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
+Example of ipFilter.rules
+
+ipFilter.rules defines N netty ipFilters separated by a comma a pattern rule
must be in this format.
+
+<'allow' or deny>:<'ip' or 'name' for computer name>:<pattern>
+or
+allow/deny:ip/name:pattern
+
+example: ipFilter.rules=allow:ip:127.*,allow:name:localhost,deny:ip:*
+
+Note that the first rule to match will apply as the example below shows from a
client on the localhost
+
+This will Allow the client on localhost be deny clients from any other ip
"allow:name:localhost,deny:ip:*"
+This will deny the client on localhost be allow clients from any other ip
"deny:name:localhost,allow:ip:*"
+
Thrift Source
~~~~~~~~~~~~~
@@ -929,13 +946,29 @@ Property Name Default De
**spoolDir** -- The directory from which to read files
from.
fileSuffix .COMPLETED Suffix to append to completely ingested
files
deletePolicy never When to delete completed files:
``never`` or ``immediate``
-fileHeader false Whether to add a header storing the
filename
-fileHeaderKey file Header key to use when appending
filename to header
+fileHeader false Whether to add a header storing the
absolute path filename.
+fileHeaderKey file Header key to use when appending
absolute path filename to event header.
+basenameHeader false Whether to add a header storing the
basename of the file.
+basenameHeaderKey basename Header Key to use when appending
basename of file to event header.
ignorePattern ^$ Regular expression specifying which
files to ignore (skip)
trackerDir .flumespool Directory to store metadata related to
processing of files.
If this path is not an absolute path,
then it is interpreted as relative to the spoolDir.
+consumeOrder oldest In which order files in the spooling
directory will be consumed ``oldest``,
+ ``youngest`` and ``random``. In case of
``oldest`` and ``youngest``, the last modified
+ time of the files will be used to
compare the files. In case of a tie, the file
+ with smallest laxicographical order will
be consumed first. In case of ``random`` any
+ file will be picked randomly. When using
``oldest`` and ``youngest`` the whole
+ directory will be scanned to pick the
oldest/youngest file, which might be slow if there
+ are a large number of files, while using
``random`` may cause old files to be consumed
+ very late if new files keep coming in
the spooling directory.
+maxBackoff 4000 The maximum time (in millis) to wait
between consecutive attempts to write to the channel(s) if the channel is full.
The source will start at a low backoff and increase it exponentially each time
the channel throws a ChannelException, upto the value specified by this
parameter.
batchSize 100 Granularity at which to batch transfer
to the channel
inputCharset UTF-8 Character set used by deserializers that
treat the input file as text.
+decodeErrorPolicy ``FAIL`` What to do when we see a non-decodable
character in the input file.
+ ``FAIL``: Throw an exception and fail to
parse the file.
+ ``REPLACE``: Replace the unparseable
character with the "replacement character" char,
+ typically Unicode U+FFFD.
+ ``IGNORE``: Drop the unparseable
character sequence.
deserializer ``LINE`` Specify the deserializer used to parse
the file into events.
Defaults to parsing each line as an
event. The class specified must implement
``EventDeserializer.Builder``.
@@ -960,6 +993,47 @@ Example for an agent named agent-1:
agent-1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
agent-1.sources.src-1.fileHeader = true
+Twitter 1% firehose Source (experimental)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. warning::
+ This source is hightly experimental and may change between minor versions of
Flume.
+ Use at your own risk.
+
+Experimental source that connects via Streaming API to the 1% sample twitter
+firehose, continously downloads tweets, converts them to Avro format and
+sends Avro events to a downstream Flume sink. Requires the consumer and
+access tokens and secrets of a Twitter developer account.
+Required properties are in **bold**.
+
+====================== ===========
===================================================
+Property Name Default Description
+====================== ===========
===================================================
+**channels** --
+**type** -- The component type name, needs to be
``org.apache.flume.source.twitter.TwitterSource``
+**consumerKey** -- OAuth consumer key
+**consumerSecret** -- OAuth consumer secret
+**accessToken** -- OAuth access token
+**accessTokenSecret** -- OAuth toekn secret
+maxBatchSize 1000 Maximum number of twitter messages to put
in a single batch
+maxBatchDurationMillis 1000 Maximum number of milliseconds to wait
before closing a batch
+====================== ===========
===================================================
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+ a1.sources = r1
+ a1.channels = c1
+ a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource
+ a1.sources.r1.channels = c1
+ a1.sources.r1.consumerKey = YOUR_TWITTER_CONSUMER_KEY
+ a1.sources.r1.consumerSecret = YOUR_TWITTER_CONSUMER_SECRET
+ a1.sources.r1.accessToken = YOUR_TWITTER_ACCESS_TOKEN
+ a1.sources.r1.accessTokenSecret = YOUR_TWITTER_ACCESS_TOKEN_SECRET
+ a1.sources.r1.maxBatchSize = 10
+ a1.sources.r1.maxBatchDurationMillis = 200
+
Event Deserializers
'''''''''''''''''''
@@ -1107,6 +1181,8 @@ Property Name Default Descriptio
**host** -- Host name or IP address to bind to
**port** -- Port # to bind to
eventSize 2500 Maximum size of a single event line, in bytes
+keepFields false Setting this to true will preserve the Priority,
+ Timestamp and Hostname in the body of the event.
selector.type replicating or multiplexing
selector.* replicating Depends on the selector.type value
interceptors -- Space-separated list of interceptors
@@ -1143,6 +1219,8 @@ Property Name Default
**host** -- Host name or IP address to bind to.
**ports** -- Space-separated list (one or more) of
ports to bind to.
eventSize 2500 Maximum size of a single event line,
in bytes.
+keepFields false Setting this to true will preserve the
+ Priority, Timestamp and Hostname in
the body of the event.
portHeader -- If specified, the port number will be
stored in the header of each event using the header name specified here. This
allows for interceptors and channel selectors to customize routing logic based
on the incoming port.
charset.default UTF-8 Default character set used while
parsing syslog events into strings.
charset.port.<port> -- Character set is configurable on a
per-port basis.
@@ -1177,6 +1255,8 @@ Property Name Default Description
**type** -- The component type name, needs to be ``syslogudp``
**host** -- Host name or IP address to bind to
**port** -- Port # to bind to
+keepFields false Setting this to true will preserve the Priority,
+ Timestamp and Hostname in the body of the event.
selector.type replicating or multiplexing
selector.* replicating Depends on the selector.type value
interceptors -- Space-separated list of interceptors
@@ -1223,6 +1303,9 @@ selector.type replicating
selector.* Depends on the
selector.type value
interceptors -- Space-separated
list of interceptors
interceptors.*
+enableSSL false Set the property
true, to enable SSL
+keystore Location of the
keystore includng keystore file name
+keystorePassword Keystore password
==================================================================================================================================
For example, a http source for agent named a1:
@@ -1397,7 +1480,7 @@ Scribe Source
Scribe is another type of ingest system. To adopt existing Scribe ingest
system,
Flume should use ScribeSource based on Thrift with compatible transfering
protocol.
-The deployment of Scribe please following guide from Facebook.
+For deployment of Scribe please follow the guide from Facebook.
Required properties are in **bold**.
============== =========== ==============================================
@@ -1514,6 +1597,13 @@ hdfs.roundValue 1 Ro
hdfs.roundUnit second The unit of the round down value -
``second``, ``minute`` or ``hour``.
hdfs.timeZone Local Time Name of the timezone that should be used
for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp false Use the local time (instead of the
timestamp from the event header) while replacing the escape sequences.
+hdfs.closeTries 0 Number of times the sink must try to
close a file. If set to 1, this sink will not re-try a failed close
+ (due to, for example, NameNode or
DataNode failure), and may leave the file in an open state with a .tmp
extension.
+ If set to 0, the sink will try to close
the file until the file is eventually closed
+ (there is no limit on the number of
times it would try).
+hdfs.retryInterval 180 Time in seconds between consecutive
attempts to close a file. Each close call costs multiple RPC round-trips to the
Namenode,
+ so setting this too low can cause a lot
of load on the name node. If set to 0 or less, the sink will not
+ attempt to close the file if the first
attempt fails, and may leave the file open or with a ".tmp" extension.
serializer ``TEXT`` Other possible options include
``avro_event`` or the
fully-qualified class name of an
implementation of the
``EventSerializer.Builder`` interface.
@@ -1569,25 +1659,26 @@ hostname / port pair. The events are tak
batches of the configured batch size.
Required properties are in **bold**.
-========================== =======
==============================================
+==========================
=====================================================
===========================================================================================
Property Name Default Description
-========================== =======
==============================================
+==========================
=====================================================
===========================================================================================
**channel** --
-**type** -- The component type name, needs to be
``avro``.
-**hostname** -- The hostname or IP address to bind to.
-**port** -- The port # to listen on.
-batch-size 100 number of event to batch together for
send.
-connect-timeout 20000 Amount of time (ms) to allow for the
first (handshake) request.
-request-timeout 20000 Amount of time (ms) to allow for
requests after the first.
-reset-connection-interval none Amount of time (s) before the connection
to the next hop is reset. This will force the Avro Sink to reconnect to the
next hop. This will allow the sink to connect to hosts behind a hardware
load-balancer when news hosts are added without having to restart the agent.
-compression-type none This can be "none" or "deflate". The
compression-type must match the compression-type of matching AvroSource
-compression-level 6 The level of compression to compress
event. 0 = no compression and 1-9 is compression. The higher the number the
more compression
-ssl false Set to true to enable SSL for this
AvroSink. When configuring SSL, you can optionally set a "truststore",
"truststore-password", "truststore-type", and specify whether to
"trust-all-certs".
-trust-all-certs false If this is set to true, SSL server
certificates for remote servers (Avro Sources) will not be checked. This should
NOT be used in production because it makes it easier for an attacker to execute
a man-in-the-middle attack and "listen in" on the encrypted connection.
-truststore -- The path to a custom Java truststore
file. Flume uses the certificate authority information in this file to
determine whether the remote Avro Source's SSL authentication credentials
should be trusted. If not specified, the default Java JSSE certificate
authority files (typically "jssecacerts" or "cacerts" in the Oracle JRE) will
be used.
-truststore-password -- The password for the specified
truststore.
-truststore-type JKS The type of the Java truststore. This
can be "JKS" or other supported Java truststore type.
-========================== =======
==============================================
+**type** --
The component type name, needs to be ``avro``.
+**hostname** --
The hostname or IP address to bind to.
+**port** --
The port # to listen on.
+batch-size 100
number of event to batch together for send.
+connect-timeout 20000
Amount of time (ms) to allow for the first (handshake) request.
+request-timeout 20000
Amount of time (ms) to allow for requests after the first.
+reset-connection-interval none
Amount of time (s) before the connection to the next hop is reset. This
will force the Avro Sink to reconnect to the next hop. This will allow the sink
to connect to hosts behind a hardware load-balancer when news hosts are added
without having to restart the agent.
+compression-type none
This can be "none" or "deflate". The compression-type must match the
compression-type of matching AvroSource
+compression-level 6
The level of compression to compress event. 0 = no compression and 1-9 is
compression. The higher the number the more compression
+ssl false
Set to true to enable SSL for this AvroSink. When configuring SSL, you can
optionally set a "truststore", "truststore-password", "truststore-type", and
specify whether to "trust-all-certs".
+trust-all-certs false
If this is set to true, SSL server certificates for remote servers (Avro
Sources) will not be checked. This should NOT be used in production because it
makes it easier for an attacker to execute a man-in-the-middle attack and
"listen in" on the encrypted connection.
+truststore --
The path to a custom Java truststore file. Flume uses the certificate
authority information in this file to determine whether the remote Avro
Source's SSL authentication credentials should be trusted. If not specified,
the default Java JSSE certificate authority files (typically "jssecacerts" or
"cacerts" in the Oracle JRE) will be used.
+truststore-password --
The password for the specified truststore.
+truststore-type JKS
The type of the Java truststore. This can be "JKS" or other supported Java
truststore type.
+maxIoWorkers 2 * the number of available processors in the
machine The maximum number of I/O worker threads. This is configured on the
NettyAvroRpcClient NioClientSocketChannelFactory.
+==========================
=====================================================
===========================================================================================
Example for agent named a1:
@@ -1760,7 +1851,11 @@ Property Name Default
**type** --
The component type name, needs to be ``hbase``
**table** --
The name of the table in Hbase to write to.
**columnFamily** --
The column family in Hbase to write to.
+zookeeperQuorum --
The quorum spec. This is the value for the property ``hbase.zookeeper.quorum``
in hbase-site.xml
+znodeParent /hbase
The base path for the znode for the -ROOT- region. Value of
``zookeeper.znode.parent`` in hbase-site.xml
batchSize 100
Number of events to be written per txn.
+coalesceIncrements false
Should the sink coalesce multiple increments to a cell per batch. This might
give
+
better performance if there are multiple increments to a limited number of
cells.
serializer org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
Default increment column = "iCol", payload column = "pCol".
serializer.* --
Properties to be passed to the serializer.
kerberosPrincipal --
Kerberos user principal for accessing secure HBase
@@ -1783,30 +1878,32 @@ AsyncHBaseSink
''''''''''''''
This sink writes data to HBase using an asynchronous model. A class
implementing
-AsyncHbaseEventSerializer
-which is specified by the configuration is used to convert the events into
+AsyncHbaseEventSerializer which is specified by the configuration is used to
convert the events into
HBase puts and/or increments. These puts and increments are then written
-to HBase. This sink provides the same consistency guarantees as HBase,
+to HBase. This sink uses the `Asynchbase API
<https://github.com/OpenTSDB/asynchbase>`_ to write to
+HBase. This sink provides the same consistency guarantees as HBase,
which is currently row-wise atomicity. In the event of Hbase failing to
write certain events, the sink will replay all events in that transaction.
The type is the FQCN: org.apache.flume.sink.hbase.AsyncHBaseSink.
Required properties are in **bold**.
-================ ============================================================
====================================================================================
-Property Name Default
Description
-================ ============================================================
====================================================================================
-**channel** --
-**type** --
The component type name, needs to be ``asynchbase``
-**table** --
The name of the table in Hbase to write to.
-zookeeperQuorum --
The quorum spec. This is the value for the property ``hbase.zookeeper.quorum``
in hbase-site.xml
-znodeParent /hbase
The base path for the znode for the -ROOT- region. Value of
``zookeeper.znode.parent`` in hbase-site.xml
-**columnFamily** --
The column family in Hbase to write to.
-batchSize 100
Number of events to be written per txn.
-timeout 60000
The length of time (in milliseconds) the sink waits for acks from hbase for
-
all events in a transaction.
-serializer org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
-serializer.* --
Properties to be passed to the serializer.
-================ ============================================================
====================================================================================
+===================
============================================================
====================================================================================
+Property Name Default
Description
+===================
============================================================
====================================================================================
+**channel** --
+**type** --
The component type name, needs to be ``asynchbase``
+**table** --
The name of the table in Hbase to write to.
+zookeeperQuorum --
The quorum spec. This is the value for the property
``hbase.zookeeper.quorum`` in hbase-site.xml
+znodeParent /hbase
The base path for the znode for the -ROOT- region. Value of
``zookeeper.znode.parent`` in hbase-site.xml
+**columnFamily** --
The column family in Hbase to write to.
+batchSize 100
Number of events to be written per txn.
+coalesceIncrements false
Should the sink coalesce multiple increments to a cell per batch. This
might give
+
better performance if there are multiple increments to a limited number of
cells.
+timeout 60000
The length of time (in milliseconds) the sink waits for acks from hbase for
+
all events in a transaction.
+serializer
org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
+serializer.* --
Properties to be passed to the serializer.
+===================
============================================================
====================================================================================
Note that this sink takes the Zookeeper Quorum and parent znode information in
the configuration. Zookeeper Quorum and parent node configuration may be
@@ -1835,7 +1932,7 @@ This sink extracts data from Flume event
This sink is well suited for use cases that stream raw data into HDFS (via the
HdfsSink) and simultaneously extract, transform and load the same data into
Solr (via MorphlineSolrSink). In particular, this sink can process arbitrary
heterogeneous raw data from disparate data sources and turn it into a data
model that is useful to Search applications.
-The ETL functionality is customizable using a `morphline configuration file
<http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that
defines a chain of transformation commands that pipe event records from one
command to another.
+The ETL functionality is customizable using a `morphline configuration file
<http://cloudera.github.io/cdk/docs/current/cdk-morphlines/index.html>`_ that
defines a chain of transformation commands that pipe event records from one
command to another.
Morphlines can be seen as an evolution of Unix pipelines where the data model
is generalized to work with streams of generic records, including arbitrary
binary payloads. A morphline command is a bit like a Flume Interceptor.
Morphlines can be embedded into Hadoop components such as Flume.
@@ -1915,7 +2012,10 @@ indexType logs
clusterName elasticsearch
Name of the ElasticSearch cluster to connect to
batchSize 100
Number of events to be written per txn.
ttl --
TTL in days, when set will cause the expired documents to be
deleted automatically,
-
if not set documents will never be automatically deleted
+
if not set documents will never be automatically deleted. TTL is
accepted both in the earlier form of
+
integer only e.g. a1.sinks.k1.ttl = 5 and also with a qualifier ms
(millisecond), s (second), m (minute),
+
h (hour), d (day) and w (week). Example a1.sinks.k1.ttl = 5d will
set TTL to 5 days. Follow
+
http://www.elasticsearch.org/guide/reference/mapping/ttl-field/ for
more information.
serializer
org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer The
ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use.
Implementations of
either class are accepted but
ElasticSearchIndexRequestBuilderFactory is preferred.
serializer.* --
Properties to be passed to the serializer.
@@ -1933,10 +2033,50 @@ Example for agent named a1:
a1.sinks.k1.indexType = bar_type
a1.sinks.k1.clusterName = foobar_cluster
a1.sinks.k1.batchSize = 500
- a1.sinks.k1.ttl = 5
+ a1.sinks.k1.ttl = 5d
a1.sinks.k1.serializer =
org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer
a1.sinks.k1.channel = c1
+Kite Dataset Sink (experimental)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. warning::
+ This source is experimental and may change between minor versions of Flume.
+ Use at your own risk.
+
+Experimental sink that writes events to a `Kite Dataset
<http://kitesdk.org/docs/current/kite-data/guide.html>`_.
+This sink will deserialize the body of each incoming event and store the
+resulting record in a Kite Dataset. It determines target Dataset by opening a
+repository URI, ``kite.repo.uri``, and loading a Dataset by name,
+``kite.dataset.name``.
+
+The only supported serialization is avro, and the record schema must be passed
+in the event headers, using either ``flume.avro.schema.literal`` with the JSON
+schema representation or ``flume.avro.schema.url`` with a URL where the schema
+may be found (``hdfs:/...`` URIs are supported). This is compatible with the
+Log4jAppender flume client and the spooling directory source's Avro
+deserializer using ``deserializer.schemaType = LITERAL``.
+
+Note 1: The ``flume.avro.schema.hash`` header is **not supported**.
+Note 2: In some cases, file rolling may occur slightly after the roll interval
+has been exceeded. However, this delay will not exceed 5 seconds. In most
+cases, the delay is neglegible.
+
+======================= =======
===========================================================
+Property Name Default Description
+======================= =======
===========================================================
+**channel** --
+**type** -- Must be
org.apache.flume.sink.kite.DatasetSink
+**kite.repo.uri** -- URI of the repository to open
+**kite.dataset.name** -- Name of the Dataset where records will be
written
+kite.batchSize 100 Number of records to process in each batch
+kite.rollInterval 30 Maximum wait time (seconds) before data
files are released
+auth.kerberosPrincipal -- Kerberos user principal for secure
authentication to HDFS
+auth.kerberosKeytab -- Kerberos keytab location (local FS) for the
principal
+auth.proxyUser -- The effective user for HDFS actions, if
different from
+ the kerberos principal
+======================= =======
===========================================================
+
Custom Sink
~~~~~~~~~~~
@@ -2059,15 +2199,13 @@ Property Name Default
checkpointDir
~/.flume/file-channel/checkpoint The directory where checkpoint file will be
stored
useDualCheckpoints false
Backup the checkpoint. If this is set to ``true``, ``backupCheckpointDir``
**must** be set
backupCheckpointDir --
The directory where the checkpoint is backed up to. This directory **must
not** be the same as the data directories or the checkpoint directory
-dataDirs ~/.flume/file-channel/data
The directory where log files will be stored
-transactionCapacity 1000
The maximum size of transaction supported by the channel
+dataDirs ~/.flume/file-channel/data
Comma separated list of directories for storing log files. Using multiple
directories on separate disks can improve file channel peformance
+transactionCapacity 10000
The maximum size of transaction supported by the channel
checkpointInterval 30000
Amount of time (in millis) between checkpoints
maxFileSize 2146435071
Max size (in bytes) of a single log file
minimumRequiredSpace 524288000
Minimum Required free space (in bytes). To avoid data corruption, File
Channel stops accepting take/put requests when free space drops below this value
capacity 1000000
Maximum capacity of the channel
keep-alive 3
Amount of time (in sec) to wait for a put operation
-write-timeout 3
Amount of time (in sec) to wait for a write operation
-checkpoint-timeout 600
Expert: Amount of time (in sec) to wait for a checkpoint
use-log-replay-v1 false
Expert: Use old replay logic
use-fast-replay false
Expert: Replay without using queue
encryption.activeKey --
Key name used to encrypt new data
@@ -2155,6 +2293,80 @@ The same scenerio as above, however key-
a1.channels.c1.encryption.keyProvider.keys.key-0.passwordFile =
/path/to/key-0.password
+Spillable Memory Channel
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The events are stored in an in-memory queue and on disk. The in-memory queue
serves as the primary store and the disk as overflow.
+The disk store is managed using an embedded File channel. When the in-memory
queue is full, additional incoming events are stored in
+the file channel. This channel is ideal for flows that need high throughput of
memory channel during normal operation, but at the
+same time need the larger capacity of the file channel for better tolerance of
intermittent sink side outages or drop in drain rates.
+The throughput will reduce approximately to file channel speeds during such
abnormal situations. In case of an agent crash or restart,
+only the events stored on disk are recovered when the agent comes online.
**This channel is currently experimental and
+not recommended for use in production.**
+
+Required properties are in **bold**. Please refer to file channel for
additional required properties.
+
+============================ ================
=============================================================================================
+Property Name Default Description
+============================ ================
=============================================================================================
+**type** -- The component type name, needs
to be ``SPILLABLEMEMORY``
+memoryCapacity 10000 Maximum number of events
stored in memory queue. To disable use of in-memory queue, set this to zero.
+overflowCapacity 100000000 Maximum number of events
stored in overflow disk (i.e File channel). To disable use of overflow, set
this to zero.
+overflowTimeout 3 The number of seconds to wait
before enabling disk overflow when memory fills up.
+byteCapacityBufferPercentage 20 Defines the percent of buffer
between byteCapacity and the estimated total size
+ of all events in the channel,
to account for data in headers. See below.
+byteCapacity see description Maximum **bytes** of memory
allowed as a sum of all events in the memory queue.
+ The implementation only counts
the Event ``body``, which is the reason for
+ providing the
``byteCapacityBufferPercentage`` configuration parameter as well.
+ Defaults to a computed value
equal to 80% of the maximum memory available to
+ the JVM (i.e. 80% of the -Xmx
value passed on the command line).
+ Note that if you have multiple
memory channels on a single JVM, and they happen
+ to hold the same physical
events (i.e. if you are using a replicating channel
+ selector from a single source)
then those event sizes may be double-counted for
+ channel byteCapacity purposes.
+ Setting this value to ``0``
will cause this value to fall back to a hard
+ internal limit of about 200 GB.
+avgEventSize 500 Estimated average size of
events, in bytes, going into the channel
+<file channel properties> see file channel Any file channel property with
the exception of 'keep-alive' and 'capacity' can be used.
+ The keep-alive of file channel
is managed by Spillable Memory Channel. Use 'overflowCapacity'
+ to set the File channel's
capacity.
+============================ ================
=============================================================================================
+
+In-memory queue is considered full if either memoryCapacity or byteCapacity
limit is reached.
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+ a1.channels = c1
+ a1.channels.c1.type = SPILLABLEMEMORY
+ a1.channels.c1.memoryCapacity = 10000
+ a1.channels.c1.overflowCapacity = 1000000
+ a1.channels.c1.byteCapacity = 800000
+ a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
+ a1.channels.c1.dataDirs = /mnt/flume/data
+
+To disable the use of the in-memory queue and function like a file channel:
+
+.. code-block:: properties
+
+ a1.channels = c1
+ a1.channels.c1.type = SPILLABLEMEMORY
+ a1.channels.c1.memoryCapacity = 0
+ a1.channels.c1.overflowCapacity = 1000000
+ a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
+ a1.channels.c1.dataDirs = /mnt/flume/data
+
+
+To disable the use of overflow disk and function purely as a in-memory channel:
+
+.. code-block:: properties
+
+ a1.channels = c1
+ a1.channels.c1.type = SPILLABLEMEMORY
+ a1.channels.c1.memoryCapacity = 100000
+
+
Pseudo Transaction Channel
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2595,7 +2807,7 @@ prefix "" The prefix st
Morphline Interceptor
~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This interceptor filters the events through a `morphline configuration file
<http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that
defines a chain of transformation commands that pipe records from one command
to another.
+This interceptor filters the events through a `morphline configuration file
<http://cloudera.github.io/cdk/docs/current/cdk-morphlines/index.html>`_ that
defines a chain of transformation commands that pipe records from one command
to another.
For example the morphline can ignore certain events or alter or insert certain
event headers via regular expression based pattern matching, or it can
auto-detect and set a MIME type via Apache Tika on events that are intercepted.
For example, this kind of packet sniffing can be used for content based dynamic
routing in a Flume topology.
MorphlineInterceptor can also help to implement dynamic routing to multiple
Apache Solr collections (e.g. for multi-tenancy).
@@ -2671,11 +2883,11 @@ If the Flume event body contained ``1:2:
.. code-block:: properties
- agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
- agent.sources.r1.interceptors.i1.serializers = s1 s2 s3
- agent.sources.r1.interceptors.i1.serializers.s1.name = one
- agent.sources.r1.interceptors.i1.serializers.s2.name = two
- agent.sources.r1.interceptors.i1.serializers.s3.name = three
+ a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
+ a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
+ a1.sources.r1.interceptors.i1.serializers.s1.name = one
+ a1.sources.r1.interceptors.i1.serializers.s2.name = two
+ a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will
have been added ``one=>1, two=>2, three=>3``
@@ -2686,11 +2898,11 @@ If the Flume event body contained ``2012
.. code-block:: properties
- agent.sources.r1.interceptors.i1.regex =
^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d)
- agent.sources.r1.interceptors.i1.serializers = s1
- agent.sources.r1.interceptors.i1.serializers.s1.type =
org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
- agent.sources.r1.interceptors.i1.serializers.s1.name = timestamp
- agent.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm
+ a1.sources.r1.interceptors.i1.regex =
^(?:\\n)?(\\d\\d\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d)
+ a1.sources.r1.interceptors.i1.serializers = s1
+ a1.sources.r1.interceptors.i1.serializers.s1.type =
org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
+ a1.sources.r1.interceptors.i1.serializers.s1.name = timestamp
+ a1.sources.r1.interceptors.i1.serializers.s1.pattern = yyyy-MM-dd HH:mm
the extracted event will contain the same body but the following headers will
have been added ``timestamp=>1350611220000``
@@ -2731,21 +2943,21 @@ Log4J Appender
Appends Log4j events to a flume agent's avro source. A client using this
appender must have the flume-ng-sdk in the classpath (eg,
-flume-ng-sdk-1.4.0.jar).
+flume-ng-sdk-1.5.0.jar).
Required properties are in **bold**.
-===================== =======
==============================================================
+===================== =======
==================================================================================
Property Name Default Description
-===================== =======
==============================================================
+===================== =======
==================================================================================
**Hostname** -- The hostname on which a remote Flume agent is
running with an
avro source.
**Port** -- The port at which the remote Flume agent's
avro source is
listening.
UnsafeMode false If true, the appender will not throw
exceptions on failure to
send the events.
-AvroReflectionEnabled false Use Avro Reflection to serialize Log4j events.
+AvroReflectionEnabled false Use Avro Reflection to serialize Log4j events.
(Do not use when users log strings)
AvroSchemaUrl -- A URL from which the Avro schema can be
retrieved.
-===================== =======
==============================================================
+===================== =======
==================================================================================
Sample log4j.properties file:
@@ -2795,7 +3007,7 @@ Load Balancing Log4J Appender
Appends Log4j events to a list of flume agent's avro source. A client using
this
appender must have the flume-ng-sdk in the classpath (eg,
-flume-ng-sdk-1.4.0.jar). This appender supports a round-robin and random
+flume-ng-sdk-1.5.0.jar). This appender supports a round-robin and random
scheme for performing the load balancing. It also supports a configurable
backoff
timeout so that down agents are removed temporarily from the set of hosts
Required properties are in **bold**.
@@ -2883,9 +3095,9 @@ and can be specified in the flume-env.sh
Property Name Default Description
======================= =======
=====================================================================================
**type** -- The component type name, has to be
``ganglia``
-**hosts** -- Comma-separated list of ``hostname:port``
-pollInterval 60 Time, in seconds, between consecutive
reporting to ganglia server
-isGanglia3 false Ganglia server version is 3. By default,
Flume sends in ganglia 3.1 format
+**hosts** -- Comma-separated list of ``hostname:port`` of
Ganglia servers
+pollFrequency 60 Time, in seconds, between consecutive
reporting to Ganglia server
+isGanglia3 false Ganglia server version is 3. By default,
Flume sends in Ganglia 3.1 format
======================= =======
=====================================================================================
We can start Flume with Ganglia support as follows::
@@ -2936,7 +3148,7 @@ Property Name Default Descri
port 41414 The port to start the server on.
======================= =======
=====================================================================================
-We can start Flume with Ganglia support as follows::
+We can start Flume with JSON Reporting support as follows::
$ bin/flume-ng agent --conf-file example.conf --name a1
-Dflume.monitoring.type=http -Dflume.monitoring.port=34545
Modified: flume/site/trunk/content/sphinx/download.rst
URL:
http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/download.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/download.rst (original)
+++ flume/site/trunk/content/sphinx/download.rst Wed May 21 22:34:24 2014
@@ -12,8 +12,8 @@ originals on the main distribution serve
:header: "", "Mirrors", "Checksum", "Signature"
:widths: 25, 25, 25, 25
- "Apache Flume binary (tar.gz)", `apache-flume-1.4.0-bin.tar.gz
<http://www.apache.org/dyn/closer.cgi/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz>`_,
`apache-flume-1.4.0-bin.tar.gz.md5
<http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz.md5>`_,
`apache-flume-1.4.0-bin.tar.gz.asc
<http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz.asc>`_
- "Apache Flume source (tar.gz)", `apache-flume-1.4.0-src.tar.gz
<http://www.apache.org/dyn/closer.cgi/flume/1.4.0/apache-flume-1.4.0-src.tar.gz>`_,
`apache-flume-1.4.0-src.tar.gz.md5
<http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-src.tar.gz.md5>`_,
`apache-flume-1.4.0-src.tar.gz.asc
<http://www.us.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-src.tar.gz.asc>`_
+ "Apache Flume binary (tar.gz)", `apache-flume-1.5.0-bin.tar.gz
<http://www.apache.org/dyn/closer.cgi/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz>`_,
`apache-flume-1.5.0-bin.tar.gz.md5
<http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz.md5>`_,
`apache-flume-1.5.0-bin.tar.gz.asc
<http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz.asc>`_
+ "Apache Flume source (tar.gz)", `apache-flume-1.5.0-src.tar.gz
<http://www.apache.org/dyn/closer.cgi/flume/1.5.0/apache-flume-1.5.0-src.tar.gz>`_,
`apache-flume-1.5.0-src.tar.gz.md5
<http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-src.tar.gz.md5>`_,
`apache-flume-1.5.0-src.tar.gz.asc
<http://www.us.apache.org/dist/flume/1.5.0/apache-flume-1.5.0-src.tar.gz.asc>`_
It is essential that you verify the integrity of the downloaded files using
the PGP or MD5 signatures. Please read
`Verifying Apache HTTP Server Releases
<http://httpd.apache.org/dev/verification.html>`_ for more information on
@@ -25,9 +25,9 @@ as well as the asc signature file for th
Then verify the signatures using::
% gpg --import KEYS
- % gpg --verify apache-flume-1.4.0-src.tar.gz.asc
+ % gpg --verify apache-flume-1.5.0-src.tar.gz.asc
-Apache Flume 1.4.0 is signed by Mike Percy 66F2054B
+Apache Flume 1.5.0 is signed by Hari Shreedharan 77FFC9AB
Alternatively, you can verify the MD5 or SHA1 signatures of the files. A
program called md5, md5sum, or shasum is included in many
Unix distributions for this purpose.
Modified: flume/site/trunk/content/sphinx/index.rst
URL:
http://svn.apache.org/viewvc/flume/site/trunk/content/sphinx/index.rst?rev=1596704&r1=1596703&r2=1596704&view=diff
==============================================================================
--- flume/site/trunk/content/sphinx/index.rst (original)
+++ flume/site/trunk/content/sphinx/index.rst Wed May 21 22:34:24 2014
@@ -33,6 +33,36 @@ application.
.. raw:: html
+ <h3>May 20, 2014 - Apache Flume 1.5.0 Released</h3>
+
+The Apache Flume team is pleased to announce the release of Flume 1.5.0.
+
+Flume is a distributed, reliable, and available service for efficiently
+collecting, aggregating, and moving large amounts of streaming event data.
+
+Version 1.5.0 is the fifth Flume release as an Apache top-level project.
+Flume 1.5.0 is stable, production-ready software, and is backwards-compatible
+with previous versions of the Flume 1.x codeline.
+
+Several months of active development went into this release: 123 patches were
committed since 1.4.0, representing many features, enhancements, and bug fixes.
While the full change log can be found on the 1.5.0 release page (link below),
here are a few new feature highlights:
+
+* New in-memory channel that can spill to disk
+* A new dataset sink that use Kite API to write data to HDFS and HBase
+* Support for Elastic Search HTTP API in Elastic Search Sink
+* Much faster replay in the File Channel.
+
+The full change log and documentation are available on the
+`Flume 1.5.0 release page <releases/1.5.0.html>`__.
+
+This release can be downloaded from the Flume `Download <download.html>`__
page.
+
+Your contributions, feedback, help and support make Flume better!
+For more information on how to report problems or contribute,
+please visit our `Get Involved <getinvolved.html>`__ page.
+
+The Apache Flume Team
+
+
<h3>July 2, 2013 - Apache Flume 1.4.0 Released</h3>
The Apache Flume team is pleased to announce the release of Flume 1.4.0.