Modified: websites/staging/flume/trunk/content/FlumeUserGuide.html
==============================================================================
--- websites/staging/flume/trunk/content/FlumeUserGuide.html (original)
+++ websites/staging/flume/trunk/content/FlumeUserGuide.html Mon Jun 1
19:50:47 2015
@@ -7,7 +7,7 @@
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
- <title>Flume 1.5.2 User Guide — Apache Flume</title>
+ <title>Flume 1.6.0 User Guide — Apache Flume</title>
<link rel="stylesheet" href="_static/flume.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
@@ -26,7 +26,7 @@
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="top" title="Apache Flume" href="index.html" />
<link rel="up" title="Documentation" href="documentation.html" />
- <link rel="next" title="Flume 1.5.2 Developer Guide"
href="FlumeDeveloperGuide.html" />
+ <link rel="next" title="Flume 1.6.0 Developer Guide"
href="FlumeDeveloperGuide.html" />
<link rel="prev" title="Documentation" href="documentation.html" />
</head>
<body>
@@ -59,8 +59,8 @@
<div class="bodywrapper">
<div class="body">
- <div class="section" id="flume-1-5-2-user-guide">
-<h1>Flume 1.5.2 User Guide<a class="headerlink" href="#flume-1-5-2-user-guide"
title="Permalink to this headline">¶</a></h1>
+ <div class="section" id="flume-1-6-0-user-guide">
+<h1>Flume 1.6.0 User Guide<a class="headerlink" href="#flume-1-6-0-user-guide"
title="Permalink to this headline">¶</a></h1>
<div class="section" id="introduction">
<h2>Introduction<a class="headerlink" href="#introduction" title="Permalink to
this headline">¶</a></h2>
<div class="section" id="overview">
@@ -248,6 +248,42 @@ OK</pre>
</div>
<p>Congratulations - you’ve successfully configured and deployed a Flume
agent! Subsequent sections cover agent configuration in much more detail.</p>
</div>
+<div class="section" id="zookeeper-based-configuration">
+<h4>Zookeeper based Configuration<a class="headerlink"
href="#zookeeper-based-configuration" title="Permalink to this
headline">¶</a></h4>
+<p>Flume supports Agent configurations via Zookeeper. <em>This is an
experimental feature.</em> The configuration file needs to be uploaded
+in the Zookeeper, under a configurable prefix. The configuration file is
stored in Zookeeper Node data.
+Following is how the Zookeeper Node tree would look like for agents a1 and
a2</p>
+<div class="highlight-properties"><pre>- /flume
+ |- /a1 [Agent config file]
+ |- /a2 [Agent config file]</pre>
+</div>
+<p>Once the configuration file is uploaded, start the agent with following
options</p>
+<blockquote>
+<div>$ bin/flume-ng agent –conf conf -z zkhost:2181,zkhost1:2181 -p
/flume –name a1 -Dflume.root.logger=INFO,console</div></blockquote>
+<table border="1" class="docutils">
+<colgroup>
+<col width="17%" />
+<col width="15%" />
+<col width="68%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Argument Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>z</strong></td>
+<td>–</td>
+<td>Zookeeper connection string. Comma separated list of hostname:port</td>
+</tr>
+<tr class="row-odd"><td><strong>p</strong></td>
+<td>/flume</td>
+<td>Base Path in Zookeeper to store Agent configurations</td>
+</tr>
+</tbody>
+</table>
+</div>
<div class="section" id="installing-third-party-plugins">
<h4>Installing third-party plugins<a class="headerlink"
href="#installing-third-party-plugins" title="Permalink to this
headline">¶</a></h4>
<p>Flume has a fully plugin-based architecture. While Flume ships with many
@@ -726,7 +762,7 @@ Required properties are in <strong>bold<
<td>false</td>
<td>Set this to true to enable ipFiltering for netty</td>
</tr>
-<tr class="row-even"><td>ipFilter.rules</td>
+<tr class="row-even"><td>ipFilterRules</td>
<td>–</td>
<td>Define N netty ipFilter pattern rules with this config.</td>
</tr>
@@ -741,12 +777,12 @@ Required properties are in <strong>bold<
<span class="na">a1.sources.r1.port</span> <span class="o">=</span> <span
class="s">4141</span>
</pre></div>
</div>
-<p>Example of ipFilter.rules</p>
-<p>ipFilter.rules defines N netty ipFilters separated by a comma a pattern
rule must be in this format.</p>
+<p>Example of ipFilterRules</p>
+<p>ipFilterRules defines N netty ipFilters separated by a comma a pattern rule
must be in this format.</p>
<p><’allow’ or deny>:<’ip’ or
‘name’ for computer name>:<pattern>
or
allow/deny:ip/name:pattern</p>
-<p>example: ipFilter.rules=allow:ip:127.*,allow:name:localhost,deny:ip:*</p>
+<p>example: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*</p>
<p>Note that the first rule to match will apply as the example below shows
from a client on the localhost</p>
<p>This will Allow the client on localhost be deny clients from any other ip
“allow:name:localhost,deny:ip:<em>”
This will deny the client on localhost be allow clients from any other ip
“deny:name:localhost,allow:ip:</em>“</p>
@@ -756,12 +792,15 @@ This will deny the client on localhost b
<p>Listens on Thrift port and receives events from external Thrift client
streams.
When paired with the built-in ThriftSink on another (previous hop) Flume agent,
it can create tiered collection topologies.
+Thrift source can be configured to start in secure mode by enabling kerberos
authentication.
+agent-principal and agent-keytab are the properties used by the
+Thrift source to authenticate to the kerberos KDC.
Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="23%" />
-<col width="14%" />
-<col width="64%" />
+<col width="5%" />
+<col width="3%" />
+<col width="91%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name</th>
@@ -806,6 +845,38 @@ Required properties are in <strong>bold<
<td> </td>
<td> </td>
</tr>
+<tr class="row-odd"><td>ssl</td>
+<td>false</td>
+<td>Set this to true to enable SSL encryption. You must also specify a
“keystore” and a “keystore-password”.</td>
+</tr>
+<tr class="row-even"><td>keystore</td>
+<td>–</td>
+<td>This is the path to a Java keystore file. Required for SSL.</td>
+</tr>
+<tr class="row-odd"><td>keystore-password</td>
+<td>–</td>
+<td>The password for the Java keystore. Required for SSL.</td>
+</tr>
+<tr class="row-even"><td>keystore-type</td>
+<td>JKS</td>
+<td>The type of the Java keystore. This can be “JKS” or
“PKCS12”.</td>
+</tr>
+<tr class="row-odd"><td>exclude-protocols</td>
+<td>SSLv3</td>
+<td>Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be
excluded in addition to the protocols specified.</td>
+</tr>
+<tr class="row-even"><td>kerberos</td>
+<td>false</td>
+<td>Set to true to enable kerberos authentication. In kerberos mode,
agent-principal and agent-keytab are required for successful authentication.
The Thrift source in secure mode, will accept connections only from Thrift
clients that have kerberos enabled and are successfully authenticated to the
kerberos KDC.</td>
+</tr>
+<tr class="row-odd"><td>agent-principal</td>
+<td>–</td>
+<td>The kerberos principal used by the Thrift Source to authenticate to the
kerberos KDC.</td>
+</tr>
+<tr class="row-even"><td>agent-keytab</td>
+<td>â-</td>
+<td>The keytab location used by the Thrift Source in combination with the
agent-principal to authenticate to the kerberos KDC.</td>
+</tr>
</tbody>
</table>
<p>Example for agent named a1:</p>
@@ -873,19 +944,23 @@ latter produces a single event and exits
<td>20</td>
<td>The max number of lines to read and send to the channel at a time</td>
</tr>
-<tr class="row-even"><td>selector.type</td>
+<tr class="row-even"><td>batchTimeout</td>
+<td>3000</td>
+<td>Amount of time (in milliseconds) to wait, if the buffer size was not
reached, before data is pushed downstream</td>
+</tr>
+<tr class="row-odd"><td>selector.type</td>
<td>replicating</td>
<td>replicating or multiplexing</td>
</tr>
-<tr class="row-odd"><td>selector.*</td>
+<tr class="row-even"><td>selector.*</td>
<td> </td>
<td>Depends on the selector.type value</td>
</tr>
-<tr class="row-even"><td>interceptors</td>
+<tr class="row-odd"><td>interceptors</td>
<td>–</td>
<td>Space-separated list of interceptors</td>
</tr>
-<tr class="row-odd"><td>interceptors.*</td>
+<tr class="row-even"><td>interceptors.*</td>
<td> </td>
<td> </td>
</tr>
@@ -931,9 +1006,9 @@ allows the ‘command’ to use
loops, conditionals etc. In the absence of the ‘shell’ config, the
‘command’ will be
invoked directly. Common values for ‘shell’ : ‘/bin/sh
-c’, ‘/bin/ksh -c’,
‘cmd /c’, ‘powershell -Command’, etc.</p>
-<div class="highlight-properties"><div class="highlight"><pre><span
class="na">agent_foo.sources.tailsource-1.type</span> <span class="o">=</span>
<span class="s">exec</span>
-<span class="na">agent_foo.sources.tailsource-1.shell</span> <span
class="o">=</span> <span class="s">/bin/bash -c</span>
-<span class="na">agent_foo.sources.tailsource-1.command</span> <span
class="o">=</span> <span class="s">for i in /path/*.txt; do cat $i; done</span>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sources.tailsource-1.type</span> <span class="o">=</span> <span
class="s">exec</span>
+<span class="na">a1.sources.tailsource-1.shell</span> <span class="o">=</span>
<span class="s">/bin/bash -c</span>
+<span class="na">a1.sources.tailsource-1.command</span> <span
class="o">=</span> <span class="s">for i in /path/*.txt; do cat $i; done</span>
</pre></div>
</div>
</div>
@@ -1199,86 +1274,13 @@ Defaults to parsing each line as an even
</tbody>
</table>
<p>Example for an agent named agent-1:</p>
-<div class="highlight-properties"><div class="highlight"><pre><span
class="na">agent-1.channels</span> <span class="o">=</span> <span
class="s">ch-1</span>
-<span class="na">agent-1.sources</span> <span class="o">=</span> <span
class="s">src-1</span>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.channels</span> <span class="o">=</span> <span
class="s">ch-1</span>
+<span class="na">a1.sources</span> <span class="o">=</span> <span
class="s">src-1</span>
-<span class="na">agent-1.sources.src-1.type</span> <span class="o">=</span>
<span class="s">spooldir</span>
-<span class="na">agent-1.sources.src-1.channels</span> <span
class="o">=</span> <span class="s">ch-1</span>
-<span class="na">agent-1.sources.src-1.spoolDir</span> <span
class="o">=</span> <span class="s">/var/log/apache/flumeSpool</span>
-<span class="na">agent-1.sources.src-1.fileHeader</span> <span
class="o">=</span> <span class="s">true</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="twitter-1-firehose-source-experimental">
-<h4>Twitter 1% firehose Source (experimental)<a class="headerlink"
href="#twitter-1-firehose-source-experimental" title="Permalink to this
headline">¶</a></h4>
-<div class="admonition warning">
-<p class="first admonition-title">Warning</p>
-<p class="last">This source is hightly experimental and may change between
minor versions of Flume.
-Use at your own risk.</p>
-</div>
-<p>Experimental source that connects via Streaming API to the 1% sample twitter
-firehose, continously downloads tweets, converts them to Avro format and
-sends Avro events to a downstream Flume sink. Requires the consumer and
-access tokens and secrets of a Twitter developer account.
-Required properties are in <strong>bold</strong>.</p>
-<table border="1" class="docutils">
-<colgroup>
-<col width="18%" />
-<col width="9%" />
-<col width="72%" />
-</colgroup>
-<thead valign="bottom">
-<tr class="row-odd"><th class="head">Property Name</th>
-<th class="head">Default</th>
-<th class="head">Description</th>
-</tr>
-</thead>
-<tbody valign="top">
-<tr class="row-even"><td><strong>channels</strong></td>
-<td>–</td>
-<td> </td>
-</tr>
-<tr class="row-odd"><td><strong>type</strong></td>
-<td>–</td>
-<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">org.apache.flume.source.twitter.TwitterSource</span></tt></td>
-</tr>
-<tr class="row-even"><td><strong>consumerKey</strong></td>
-<td>–</td>
-<td>OAuth consumer key</td>
-</tr>
-<tr class="row-odd"><td><strong>consumerSecret</strong></td>
-<td>–</td>
-<td>OAuth consumer secret</td>
-</tr>
-<tr class="row-even"><td><strong>accessToken</strong></td>
-<td>–</td>
-<td>OAuth access token</td>
-</tr>
-<tr class="row-odd"><td><strong>accessTokenSecret</strong></td>
-<td>–</td>
-<td>OAuth toekn secret</td>
-</tr>
-<tr class="row-even"><td>maxBatchSize</td>
-<td>1000</td>
-<td>Maximum number of twitter messages to put in a single batch</td>
-</tr>
-<tr class="row-odd"><td>maxBatchDurationMillis</td>
-<td>1000</td>
-<td>Maximum number of milliseconds to wait before closing a batch</td>
-</tr>
-</tbody>
-</table>
-<p>Example for agent named a1:</p>
-<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sources</span> <span class="o">=</span> <span class="s">r1</span>
-<span class="na">a1.channels</span> <span class="o">=</span> <span
class="s">c1</span>
-<span class="na">a1.sources.r1.type</span> <span class="o">=</span> <span
class="s">org.apache.flume.source.twitter.TwitterSource</span>
-<span class="na">a1.sources.r1.channels</span> <span class="o">=</span> <span
class="s">c1</span>
-<span class="na">a1.sources.r1.consumerKey</span> <span class="o">=</span>
<span class="s">YOUR_TWITTER_CONSUMER_KEY</span>
-<span class="na">a1.sources.r1.consumerSecret</span> <span class="o">=</span>
<span class="s">YOUR_TWITTER_CONSUMER_SECRET</span>
-<span class="na">a1.sources.r1.accessToken</span> <span class="o">=</span>
<span class="s">YOUR_TWITTER_ACCESS_TOKEN</span>
-<span class="na">a1.sources.r1.accessTokenSecret</span> <span
class="o">=</span> <span class="s">YOUR_TWITTER_ACCESS_TOKEN_SECRET</span>
-<span class="na">a1.sources.r1.maxBatchSize</span> <span class="o">=</span>
<span class="s">10</span>
-<span class="na">a1.sources.r1.maxBatchDurationMillis</span> <span
class="o">=</span> <span class="s">200</span>
+<span class="na">a1.sources.src-1.type</span> <span class="o">=</span> <span
class="s">spooldir</span>
+<span class="na">a1.sources.src-1.channels</span> <span class="o">=</span>
<span class="s">ch-1</span>
+<span class="na">a1.sources.src-1.spoolDir</span> <span class="o">=</span>
<span class="s">/var/log/apache/flumeSpool</span>
+<span class="na">a1.sources.src-1.fileHeader</span> <span class="o">=</span>
<span class="s">true</span>
</pre></div>
</div>
<div class="section" id="event-deserializers">
@@ -1381,6 +1383,155 @@ inefficient compared to <tt class="docut
</div>
</div>
</div>
+<div class="section" id="twitter-1-firehose-source-experimental">
+<h4>Twitter 1% firehose Source (experimental)<a class="headerlink"
href="#twitter-1-firehose-source-experimental" title="Permalink to this
headline">¶</a></h4>
+<div class="admonition warning">
+<p class="first admonition-title">Warning</p>
+<p class="last">This source is hightly experimental and may change between
minor versions of Flume.
+Use at your own risk.</p>
+</div>
+<p>Experimental source that connects via Streaming API to the 1% sample twitter
+firehose, continously downloads tweets, converts them to Avro format and
+sends Avro events to a downstream Flume sink. Requires the consumer and
+access tokens and secrets of a Twitter developer account.
+Required properties are in <strong>bold</strong>.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="18%" />
+<col width="9%" />
+<col width="72%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Property Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>channels</strong></td>
+<td>–</td>
+<td> </td>
+</tr>
+<tr class="row-odd"><td><strong>type</strong></td>
+<td>–</td>
+<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">org.apache.flume.source.twitter.TwitterSource</span></tt></td>
+</tr>
+<tr class="row-even"><td><strong>consumerKey</strong></td>
+<td>–</td>
+<td>OAuth consumer key</td>
+</tr>
+<tr class="row-odd"><td><strong>consumerSecret</strong></td>
+<td>–</td>
+<td>OAuth consumer secret</td>
+</tr>
+<tr class="row-even"><td><strong>accessToken</strong></td>
+<td>–</td>
+<td>OAuth access token</td>
+</tr>
+<tr class="row-odd"><td><strong>accessTokenSecret</strong></td>
+<td>–</td>
+<td>OAuth toekn secret</td>
+</tr>
+<tr class="row-even"><td>maxBatchSize</td>
+<td>1000</td>
+<td>Maximum number of twitter messages to put in a single batch</td>
+</tr>
+<tr class="row-odd"><td>maxBatchDurationMillis</td>
+<td>1000</td>
+<td>Maximum number of milliseconds to wait before closing a batch</td>
+</tr>
+</tbody>
+</table>
+<p>Example for agent named a1:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sources</span> <span class="o">=</span> <span class="s">r1</span>
+<span class="na">a1.channels</span> <span class="o">=</span> <span
class="s">c1</span>
+<span class="na">a1.sources.r1.type</span> <span class="o">=</span> <span
class="s">org.apache.flume.source.twitter.TwitterSource</span>
+<span class="na">a1.sources.r1.channels</span> <span class="o">=</span> <span
class="s">c1</span>
+<span class="na">a1.sources.r1.consumerKey</span> <span class="o">=</span>
<span class="s">YOUR_TWITTER_CONSUMER_KEY</span>
+<span class="na">a1.sources.r1.consumerSecret</span> <span class="o">=</span>
<span class="s">YOUR_TWITTER_CONSUMER_SECRET</span>
+<span class="na">a1.sources.r1.accessToken</span> <span class="o">=</span>
<span class="s">YOUR_TWITTER_ACCESS_TOKEN</span>
+<span class="na">a1.sources.r1.accessTokenSecret</span> <span
class="o">=</span> <span class="s">YOUR_TWITTER_ACCESS_TOKEN_SECRET</span>
+<span class="na">a1.sources.r1.maxBatchSize</span> <span class="o">=</span>
<span class="s">10</span>
+<span class="na">a1.sources.r1.maxBatchDurationMillis</span> <span
class="o">=</span> <span class="s">200</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="kafka-source">
+<h4>Kafka Source<a class="headerlink" href="#kafka-source" title="Permalink to
this headline">¶</a></h4>
+<p>Kafka Source is an Apache Kafka consumer that reads messages from a Kafka
topic.
+If you have multiple Kafka sources running, you can configure them with the
same Consumer Group
+so each will read a unique set of partitions for the topic.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="21%" />
+<col width="8%" />
+<col width="71%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Property Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>channels</strong></td>
+<td>–</td>
+<td> </td>
+</tr>
+<tr class="row-odd"><td><strong>type</strong></td>
+<td>–</td>
+<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">org.apache.flume.source.kafka,KafkaSource</span></tt></td>
+</tr>
+<tr class="row-even"><td><strong>zookeeperConnect</strong></td>
+<td>–</td>
+<td>URI of ZooKeeper used by Kafka cluster</td>
+</tr>
+<tr class="row-odd"><td><strong>groupId</strong></td>
+<td>flume</td>
+<td>Unique identified of consumer group. Setting the same id in multiple
sources or agents
+indicates that they are part of the same consumer group</td>
+</tr>
+<tr class="row-even"><td><strong>topic</strong></td>
+<td>–</td>
+<td>Kafka topic we’ll read messages from. At the time, this is a single
topic only.</td>
+</tr>
+<tr class="row-odd"><td>batchSize</td>
+<td>1000</td>
+<td>Maximum number of messages written to Channel in one batch</td>
+</tr>
+<tr class="row-even"><td>batchDurationMillis</td>
+<td>1000</td>
+<td>Maximum time (in ms) before a batch will be written to Channel
+The batch will be written whenever the first of size and time will be
reached.</td>
+</tr>
+<tr class="row-odd"><td>Other Kafka Consumer Properties</td>
+<td>–</td>
+<td>These properties are used to configure the Kafka Consumer. Any producer
property supported
+by Kafka can be used. The only requirement is to prepend the property name
with the prefix <tt class="docutils literal"><span
class="pre">kafka.</span></tt>.
+For example: kafka.consumer.timeout.ms
+Check <cite>Kafka documentation
<https://kafka.apache.org/08/configuration.html#consumerconfigs></cite>
for details</td>
+</tr>
+</tbody>
+</table>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">The Kafka Source overrides two Kafka consumer parameters:
+auto.commit.enable is set to “false” by the source and we commit
every batch. For improved performance
+this can be set to “true”, however, this can lead to loss of data
+consumer.timeout.ms is set to 10ms, so when we check Kafka for new data we
wait at most 10ms for the data to arrive
+setting this to a higher value can reduce CPU utilization (we’ll poll
Kafka in less of a tight loop), but also means
+higher latency in writing batches to channel (since we’ll wait longer
for data to arrive).</p>
+</div>
+<p>Example for agent named tier1:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">tier1.sources.source1.type</span> <span class="o">=</span> <span
class="s">org.apache.flume.source.kafka.KafkaSource</span>
+<span class="na">tier1.sources.source1.channels</span> <span
class="o">=</span> <span class="s">channel1</span>
+<span class="na">tier1.sources.source1.zookeeperConnect</span> <span
class="o">=</span> <span class="s">localhost:2181</span>
+<span class="na">tier1.sources.source1.topic</span> <span class="o">=</span>
<span class="s">test1</span>
+<span class="na">tier1.sources.source1.groupId</span> <span class="o">=</span>
<span class="s">flume</span>
+<span class="na">tier1.sources.source1.kafka.consumer.timeout.ms</span> <span
class="o">=</span> <span class="s">100</span>
+</pre></div>
+</div>
+</div>
<div class="section" id="netcat-source">
<h4>NetCat Source<a class="headerlink" href="#netcat-source" title="Permalink
to this headline">¶</a></h4>
<p>A netcat-like source that listens on a given port and turns each line of
text
@@ -1553,9 +1704,14 @@ of characters separated by a newline (&#
<td>Maximum size of a single event line, in bytes</td>
</tr>
<tr class="row-odd"><td>keepFields</td>
-<td>false</td>
-<td>Setting this to true will preserve the Priority,
-Timestamp and Hostname in the body of the event.</td>
+<td>none</td>
+<td>Setting this to ‘all’ will preserve the Priority,
+Timestamp and Hostname in the body of the event.
+A spaced separated list of fields to include
+is allowed as well. Currently, the following
+fields can be included: priority, version,
+timestamp, hostname. The values ‘true’ and ‘false’
+have been deprecated in favor of ‘all’ and ‘none’.</td>
</tr>
<tr class="row-even"><td>selector.type</td>
<td> </td>
@@ -1628,9 +1784,14 @@ basis.</p>
<td>Maximum size of a single event line, in bytes.</td>
</tr>
<tr class="row-odd"><td>keepFields</td>
-<td>false</td>
-<td>Setting this to true will preserve the
-Priority, Timestamp and Hostname in the body of the event.</td>
+<td>none</td>
+<td>Setting this to ‘all’ will preserve the
+Priority, Timestamp and Hostname in the body of the event.
+A spaced separated list of fields to include
+is allowed as well. Currently, the following
+fields can be included: priority, version,
+timestamp, hostname. The values ‘true’ and ‘false’
+have been deprecated in favor of ‘all’ and ‘none’.</td>
</tr>
<tr class="row-even"><td>portHeader</td>
<td>–</td>
@@ -1904,6 +2065,58 @@ for list of events can be created by:</p
</table>
</div>
</div>
+<div class="section" id="stress-source">
+<h4>Stress Source<a class="headerlink" href="#stress-source" title="Permalink
to this headline">¶</a></h4>
+<p>StressSource is an internal load-generating source implementation which is
very useful for
+stress tests. It allows User to configure the size of Event payload, with
empty headers.
+User can configure total number of events to be sent as well maximum number of
Successful
+Event to be delivered.</p>
+<p>Required properties are in <strong>bold</strong>.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="18%" />
+<col width="10%" />
+<col width="72%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Property Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>type</strong></td>
+<td>–</td>
+<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">org.apache.flume.source.StressSource</span></tt></td>
+</tr>
+<tr class="row-odd"><td>size</td>
+<td>500</td>
+<td>Payload size of each Event. Unit:<strong>byte</strong></td>
+</tr>
+<tr class="row-even"><td>maxTotalEvents</td>
+<td>-1</td>
+<td>Maximum number of Events to be sent</td>
+</tr>
+<tr class="row-odd"><td>maxSuccessfulEvents</td>
+<td>-1</td>
+<td>Maximum number of Events successfully sent</td>
+</tr>
+<tr class="row-even"><td>batchSize</td>
+<td>1</td>
+<td>Number of Events to be sent in one batch</td>
+</tr>
+</tbody>
+</table>
+<p>Example for agent named <strong>a1</strong>:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sources</span> <span class="o">=</span> <span
class="s">stresssource-1</span>
+<span class="na">a1.channels</span> <span class="o">=</span> <span
class="s">memoryChannel-1</span>
+<span class="na">a1.sources.stresssource-1.type</span> <span
class="o">=</span> <span class="s">org.apache.flume.source.StressSource</span>
+<span class="na">a1.sources.stresssource-1.size</span> <span
class="o">=</span> <span class="s">10240</span>
+<span class="na">a1.sources.stresssource-1.maxTotalEvents</span> <span
class="o">=</span> <span class="s">1000000</span>
+<span class="na">a1.sources.stresssource-1.channels</span> <span
class="o">=</span> <span class="s">memoryChannel-1</span>
+</pre></div>
+</div>
+</div>
<div class="section" id="legacy-sources">
<h4>Legacy Sources<a class="headerlink" href="#legacy-sources"
title="Permalink to this headline">¶</a></h4>
<p>The legacy sources allow a Flume 1.x agent to receive events from Flume
0.9.4
@@ -2103,9 +2316,9 @@ For deployment of Scribe please follow t
Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="13%" />
+<col width="17%" />
<col width="10%" />
-<col width="77%" />
+<col width="73%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name</th>
@@ -2122,15 +2335,19 @@ Required properties are in <strong>bold<
<td>1499</td>
<td>Port that Scribe should be connected</td>
</tr>
-<tr class="row-even"><td>workerThreads</td>
+<tr class="row-even"><td>maxReadBufferBytes</td>
+<td>16384000</td>
+<td>Thrift Default FrameBuffer Size</td>
+</tr>
+<tr class="row-odd"><td>workerThreads</td>
<td>5</td>
<td>Handing threads number in Thrift</td>
</tr>
-<tr class="row-odd"><td>selector.type</td>
+<tr class="row-even"><td>selector.type</td>
<td> </td>
<td> </td>
</tr>
-<tr class="row-even"><td>selector.*</td>
+<tr class="row-odd"><td>selector.*</td>
<td> </td>
<td> </td>
</tr>
@@ -2198,24 +2415,30 @@ required.</p>
<tr class="row-odd"><td>%d</td>
<td>day of month (01)</td>
</tr>
-<tr class="row-even"><td>%D</td>
+<tr class="row-even"><td>%e</td>
+<td>day of month without padding (1)</td>
+</tr>
+<tr class="row-odd"><td>%D</td>
<td>date; same as %m/%d/%y</td>
</tr>
-<tr class="row-odd"><td>%H</td>
+<tr class="row-even"><td>%H</td>
<td>hour (00..23)</td>
</tr>
-<tr class="row-even"><td>%I</td>
+<tr class="row-odd"><td>%I</td>
<td>hour (01..12)</td>
</tr>
-<tr class="row-odd"><td>%j</td>
+<tr class="row-even"><td>%j</td>
<td>day of year (001..366)</td>
</tr>
-<tr class="row-even"><td>%k</td>
+<tr class="row-odd"><td>%k</td>
<td>hour ( 0..23)</td>
</tr>
-<tr class="row-odd"><td>%m</td>
+<tr class="row-even"><td>%m</td>
<td>month (01..12)</td>
</tr>
+<tr class="row-odd"><td>%n</td>
+<td>month without padding (1..12)</td>
+</tr>
<tr class="row-even"><td>%M</td>
<td>minute (00..59)</td>
</tr>
@@ -2251,9 +2474,9 @@ this automatically is to use the Timesta
</div>
<table border="1" class="docutils">
<colgroup>
-<col width="14%" />
-<col width="8%" />
-<col width="79%" />
+<col width="12%" />
+<col width="6%" />
+<col width="82%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Name</th>
@@ -2356,69 +2579,329 @@ This number should be increased if many
<td>–</td>
<td>Kerberos keytab for accessing secure HDFS</td>
</tr>
-<tr class="row-even"><td>hdfs.proxyUser</td>
-<td> </td>
-<td> </td>
+<tr class="row-even"><td>hdfs.proxyUser</td>
+<td> </td>
+<td> </td>
+</tr>
+<tr class="row-odd"><td>hdfs.round</td>
+<td>false</td>
+<td>Should the timestamp be rounded down (if true, affects all time based
escape sequences except %t)</td>
+</tr>
+<tr class="row-even"><td>hdfs.roundValue</td>
+<td>1</td>
+<td>Rounded down to the highest multiple of this (in the unit configured using
<tt class="docutils literal"><span class="pre">hdfs.roundUnit</span></tt>),
less than current time.</td>
+</tr>
+<tr class="row-odd"><td>hdfs.roundUnit</td>
+<td>second</td>
+<td>The unit of the round down value - <tt class="docutils literal"><span
class="pre">second</span></tt>, <tt class="docutils literal"><span
class="pre">minute</span></tt> or <tt class="docutils literal"><span
class="pre">hour</span></tt>.</td>
+</tr>
+<tr class="row-even"><td>hdfs.timeZone</td>
+<td>Local Time</td>
+<td>Name of the timezone that should be used for resolving the directory path,
e.g. America/Los_Angeles.</td>
+</tr>
+<tr class="row-odd"><td>hdfs.useLocalTimeStamp</td>
+<td>false</td>
+<td>Use the local time (instead of the timestamp from the event header) while
replacing the escape sequences.</td>
+</tr>
+<tr class="row-even"><td>hdfs.closeTries</td>
+<td>0</td>
+<td>Number of times the sink must try renaming a file, after initiating a
close attempt. If set to 1, this sink will not re-try a failed rename
+(due to, for example, NameNode or DataNode failure), and may leave the file in
an open state with a .tmp extension.
+If set to 0, the sink will try to rename the file until the file is eventually
renamed (there is no limit on the number of times it would try).
+The file may still remain open if the close call fails but the data will be
intact and in this case, the file will be closed only after a Flume
restart.</td>
+</tr>
+<tr class="row-odd"><td>hdfs.retryInterval</td>
+<td>180</td>
+<td>Time in seconds between consecutive attempts to close a file. Each close
call costs multiple RPC round-trips to the Namenode,
+so setting this too low can cause a lot of load on the name node. If set to 0
or less, the sink will not
+attempt to close the file if the first attempt fails, and may leave the file
open or with a ”.tmp” extension.</td>
+</tr>
+<tr class="row-even"><td>serializer</td>
+<td><tt class="docutils literal"><span class="pre">TEXT</span></tt></td>
+<td>Other possible options include <tt class="docutils literal"><span
class="pre">avro_event</span></tt> or the
+fully-qualified class name of an implementation of the
+<tt class="docutils literal"><span
class="pre">EventSerializer.Builder</span></tt> interface.</td>
+</tr>
+<tr class="row-odd"><td>serializer.*</td>
+<td> </td>
+<td> </td>
+</tr>
+</tbody>
+</table>
+<p>Example for agent named a1:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.channels</span> <span class="o">=</span> <span class="s">c1</span>
+<span class="na">a1.sinks</span> <span class="o">=</span> <span
class="s">k1</span>
+<span class="na">a1.sinks.k1.type</span> <span class="o">=</span> <span
class="s">hdfs</span>
+<span class="na">a1.sinks.k1.channel</span> <span class="o">=</span> <span
class="s">c1</span>
+<span class="na">a1.sinks.k1.hdfs.path</span> <span class="o">=</span> <span
class="s">/flume/events/%y-%m-%d/%H%M/%S</span>
+<span class="na">a1.sinks.k1.hdfs.filePrefix</span> <span class="o">=</span>
<span class="s">events-</span>
+<span class="na">a1.sinks.k1.hdfs.round</span> <span class="o">=</span> <span
class="s">true</span>
+<span class="na">a1.sinks.k1.hdfs.roundValue</span> <span class="o">=</span>
<span class="s">10</span>
+<span class="na">a1.sinks.k1.hdfs.roundUnit</span> <span class="o">=</span>
<span class="s">minute</span>
+</pre></div>
+</div>
+<p>The above configuration will round down the timestamp to the last 10th
minute. For example, an event with
+timestamp 11:54:34 AM, June 12, 2012 will cause the hdfs path to become <tt
class="docutils literal"><span
class="pre">/flume/events/2012-06-12/1150/00</span></tt>.</p>
+</div>
+<div class="section" id="hive-sink">
+<h4>Hive Sink<a class="headerlink" href="#hive-sink" title="Permalink to this
headline">¶</a></h4>
+<p>This sink streams events containing delimited text or JSON data directly
into a Hive table or partition.
+Events are written using Hive transactions. As soon as a set of events are
committed to Hive, they become
+immediately visible to Hive queries. Partitions to which flume will stream to
can either be pre-created
+or, optionally, Flume can create them if they are missing. Fields from
incoming event data are mapped to
+corresponding columns in the Hive table. <strong>This sink is provided as a
preview feature and not recommended
+for use in production.</strong></p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="15%" />
+<col width="8%" />
+<col width="77%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>channel</strong></td>
+<td>–</td>
+<td> </td>
+</tr>
+<tr class="row-odd"><td><strong>type</strong></td>
+<td>–</td>
+<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">hive</span></tt></td>
+</tr>
+<tr class="row-even"><td><strong>hive.metastore</strong></td>
+<td>–</td>
+<td>Hive metastore URI (eg thrift://a.b.com:9083 )</td>
+</tr>
+<tr class="row-odd"><td><strong>hive.database</strong></td>
+<td>–</td>
+<td>Hive database name</td>
+</tr>
+<tr class="row-even"><td><strong>hive.table</strong></td>
+<td>–</td>
+<td>Hive table name</td>
+</tr>
+<tr class="row-odd"><td>hive.partition</td>
+<td>–</td>
+<td>Comma separate list of partition values identifying the partition to write
to. May contain escape
+sequences. E.g: If the table is partitioned by (continent: string, country
:string, time : string)
+then ‘Asia,India,2014-02-26-01-21’ will indicate
continent=Asia,country=India,time=2014-02-26-01-21</td>
+</tr>
+<tr class="row-even"><td>hive.txnsPerBatchAsk</td>
+<td>100</td>
+<td>Hive grants a <em>batch of transactions</em> instead of single
transactions to streaming clients like Flume.
+This setting configures the number of desired transactions per Transaction
Batch. Data from all
+transactions in a single batch end up in a single file. Flume will write a
maximum of batchSize events
+in each transaction in the batch. This setting in conjunction with batchSize
provides control over the
+size of each file. Note that eventually Hive will transparently compact these
files into larger files.</td>
+</tr>
+<tr class="row-odd"><td>heartBeatInterval</td>
+<td>240</td>
+<td>(In seconds) Interval between consecutive heartbeats sent to Hive to keep
unused transactions from expiring.
+Set this value to 0 to disable heartbeats.</td>
+</tr>
+<tr class="row-even"><td>autoCreatePartitions</td>
+<td>true</td>
+<td>Flume will automatically create the necessary Hive partitions to stream
to</td>
+</tr>
+<tr class="row-odd"><td>batchSize</td>
+<td>15000</td>
+<td>Max number of events written to Hive in a single Hive transaction</td>
+</tr>
+<tr class="row-even"><td>maxOpenConnections</td>
+<td>500</td>
+<td>Allow only this number of open connections. If this number is exceeded,
the least recently used connection is closed.</td>
+</tr>
+<tr class="row-odd"><td>callTimeout</td>
+<td>10000</td>
+<td>(In milliseconds) Timeout for Hive & HDFS I/O operations, such as
openTxn, write, commit, abort.</td>
+</tr>
+<tr class="row-even"><td><strong>serializer</strong></td>
+<td> </td>
+<td>Serializer is responsible for parsing out field from the event and mapping
them to columns in the hive table.
+Choice of serializer depends upon the format of the data in the event.
Supported serializers: DELIMITED and JSON</td>
+</tr>
+<tr class="row-odd"><td>roundUnit</td>
+<td>minute</td>
+<td>The unit of the round down value - <tt class="docutils literal"><span
class="pre">second</span></tt>, <tt class="docutils literal"><span
class="pre">minute</span></tt> or <tt class="docutils literal"><span
class="pre">hour</span></tt>.</td>
+</tr>
+<tr class="row-even"><td>roundValue</td>
+<td>1</td>
+<td>Rounded down to the highest multiple of this (in the unit configured using
hive.roundUnit), less than current time</td>
+</tr>
+<tr class="row-odd"><td>timeZone</td>
+<td>Local Time</td>
+<td>Name of the timezone that should be used for resolving the escape
sequences in partition, e.g. America/Los_Angeles.</td>
+</tr>
+<tr class="row-even"><td>useLocalTimeStamp</td>
+<td>false</td>
+<td>Use the local time (instead of the timestamp from the event header) while
replacing the escape sequences.</td>
+</tr>
+</tbody>
+</table>
+<p>Following serializers are provided for Hive sink:</p>
+<p><strong>JSON</strong>: Handles UTF8 encoded Json (strict syntax) events and
requires no configration. Object names
+in the JSON are mapped directly to columns with the same name in the Hive
table.
+Internally uses org.apache.hive.hcatalog.data.JsonSerDe but is independent of
the Serde of the Hive table.
+This serializer requires HCatalog to be installed.</p>
+<p><strong>DELIMITED</strong>: Handles simple delimited textual events.
+Internally uses LazySimpleSerde but is independent of the Serde of the Hive
table.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="22%" />
+<col width="10%" />
+<col width="68%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td>serializer.delimiter</td>
+<td>,</td>
+<td>(Type: string) The field delimiter in the incoming data. To use special
+characters, surround them with double quotes like “\t”</td>
+</tr>
+<tr class="row-odd"><td><strong>serializer.fieldnames</strong></td>
+<td>–</td>
+<td>The mapping from input fields to columns in hive table. Specified as a
+comma separated list (no spaces) of hive table columns names, identifying
+the input fields in order of their occurrence. To skip fields leave the
+column name unspecified. Eg. ‘time,,ip,message’ indicates the 1st,
3rd
+and 4th fields in input map to time, ip and message columns in the hive
table.</td>
+</tr>
+<tr class="row-even"><td>serializer.serdeSeparator</td>
+<td>Ctrl-A</td>
+<td>(Type: character) Customizes the separator used by underlying serde. There
+can be a gain in efficiency if the fields in serializer.fieldnames are in
+same order as table columns, the serializer.delimiter is same as the
+serializer.serdeSeparator and number of fields in serializer.fieldnames
+is less than or equal to number of table columns, as the fields in incoming
+event body do not need to be reordered to match order of table columns.
+Use single quotes for special characters like ‘\t’.
+Ensure input fields do not contain this character. NOTE: If
serializer.delimiter
+is a single character, preferably set this to the same character</td>
+</tr>
+</tbody>
+</table>
+<p>The following are the escape sequences supported:</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="10%" />
+<col width="90%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Alias</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td>%{host}</td>
+<td>Substitute value of event header named “host”. Arbitrary
header names are supported.</td>
+</tr>
+<tr class="row-odd"><td>%t</td>
+<td>Unix time in milliseconds</td>
+</tr>
+<tr class="row-even"><td>%a</td>
+<td>locale’s short weekday name (Mon, Tue, ...)</td>
+</tr>
+<tr class="row-odd"><td>%A</td>
+<td>locale’s full weekday name (Monday, Tuesday, ...)</td>
+</tr>
+<tr class="row-even"><td>%b</td>
+<td>locale’s short month name (Jan, Feb, ...)</td>
+</tr>
+<tr class="row-odd"><td>%B</td>
+<td>locale’s long month name (January, February, ...)</td>
+</tr>
+<tr class="row-even"><td>%c</td>
+<td>locale’s date and time (Thu Mar 3 23:05:25 2005)</td>
+</tr>
+<tr class="row-odd"><td>%d</td>
+<td>day of month (01)</td>
+</tr>
+<tr class="row-even"><td>%D</td>
+<td>date; same as %m/%d/%y</td>
+</tr>
+<tr class="row-odd"><td>%H</td>
+<td>hour (00..23)</td>
+</tr>
+<tr class="row-even"><td>%I</td>
+<td>hour (01..12)</td>
+</tr>
+<tr class="row-odd"><td>%j</td>
+<td>day of year (001..366)</td>
</tr>
-<tr class="row-odd"><td>hdfs.round</td>
-<td>false</td>
-<td>Should the timestamp be rounded down (if true, affects all time based
escape sequences except %t)</td>
+<tr class="row-even"><td>%k</td>
+<td>hour ( 0..23)</td>
</tr>
-<tr class="row-even"><td>hdfs.roundValue</td>
-<td>1</td>
-<td>Rounded down to the highest multiple of this (in the unit configured using
<tt class="docutils literal"><span class="pre">hdfs.roundUnit</span></tt>),
less than current time.</td>
+<tr class="row-odd"><td>%m</td>
+<td>month (01..12)</td>
</tr>
-<tr class="row-odd"><td>hdfs.roundUnit</td>
-<td>second</td>
-<td>The unit of the round down value - <tt class="docutils literal"><span
class="pre">second</span></tt>, <tt class="docutils literal"><span
class="pre">minute</span></tt> or <tt class="docutils literal"><span
class="pre">hour</span></tt>.</td>
+<tr class="row-even"><td>%M</td>
+<td>minute (00..59)</td>
</tr>
-<tr class="row-even"><td>hdfs.timeZone</td>
-<td>Local Time</td>
-<td>Name of the timezone that should be used for resolving the directory path,
e.g. America/Los_Angeles.</td>
+<tr class="row-odd"><td>%p</td>
+<td>locale’s equivalent of am or pm</td>
</tr>
-<tr class="row-odd"><td>hdfs.useLocalTimeStamp</td>
-<td>false</td>
-<td>Use the local time (instead of the timestamp from the event header) while
replacing the escape sequences.</td>
+<tr class="row-even"><td>%s</td>
+<td>seconds since 1970-01-01 00:00:00 UTC</td>
</tr>
-<tr class="row-even"><td>hdfs.closeTries</td>
-<td>0</td>
-<td>Number of times the sink must try to close a file. If set to 1, this sink
will not re-try a failed close
-(due to, for example, NameNode or DataNode failure), and may leave the file in
an open state with a .tmp extension.
-If set to 0, the sink will try to close the file until the file is eventually
closed
-(there is no limit on the number of times it would try).</td>
+<tr class="row-odd"><td>%S</td>
+<td>second (00..59)</td>
</tr>
-<tr class="row-odd"><td>hdfs.retryInterval</td>
-<td>180</td>
-<td>Time in seconds between consecutive attempts to close a file. Each close
call costs multiple RPC round-trips to the Namenode,
-so setting this too low can cause a lot of load on the name node. If set to 0
or less, the sink will not
-attempt to close the file if the first attempt fails, and may leave the file
open or with a ”.tmp” extension.</td>
+<tr class="row-even"><td>%y</td>
+<td>last two digits of year (00..99)</td>
</tr>
-<tr class="row-even"><td>serializer</td>
-<td><tt class="docutils literal"><span class="pre">TEXT</span></tt></td>
-<td>Other possible options include <tt class="docutils literal"><span
class="pre">avro_event</span></tt> or the
-fully-qualified class name of an implementation of the
-<tt class="docutils literal"><span
class="pre">EventSerializer.Builder</span></tt> interface.</td>
+<tr class="row-odd"><td>%Y</td>
+<td>year (2010)</td>
</tr>
-<tr class="row-odd"><td>serializer.*</td>
-<td> </td>
-<td> </td>
+<tr class="row-even"><td>%z</td>
+<td>+hhmm numeric timezone (for example, -0400)</td>
</tr>
</tbody>
</table>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">For all of the time related escape sequences, a header with
the key
+“timestamp” must exist among the headers of the event (unless <tt
class="docutils literal"><span class="pre">useLocalTimeStamp</span></tt> is set
to <tt class="docutils literal"><span class="pre">true</span></tt>). One way to
add
+this automatically is to use the TimestampInterceptor.</p>
+</div>
+<p>Example Hive table :</p>
+<div class="highlight-properties"><pre>create table weblogs ( id int , msg
string )
+ partitioned by (continent string, country string, time string)
+ clustered by (id) into 5 buckets
+ stored as orc;</pre>
+</div>
<p>Example for agent named a1:</p>
<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.channels</span> <span class="o">=</span> <span class="s">c1</span>
+<span class="na">a1.channels.c1.type</span> <span class="o">=</span> <span
class="s">memory</span>
<span class="na">a1.sinks</span> <span class="o">=</span> <span
class="s">k1</span>
-<span class="na">a1.sinks.k1.type</span> <span class="o">=</span> <span
class="s">hdfs</span>
+<span class="na">a1.sinks.k1.type</span> <span class="o">=</span> <span
class="s">hive</span>
<span class="na">a1.sinks.k1.channel</span> <span class="o">=</span> <span
class="s">c1</span>
-<span class="na">a1.sinks.k1.hdfs.path</span> <span class="o">=</span> <span
class="s">/flume/events/%y-%m-%d/%H%M/%S</span>
-<span class="na">a1.sinks.k1.hdfs.filePrefix</span> <span class="o">=</span>
<span class="s">events-</span>
-<span class="na">a1.sinks.k1.hdfs.round</span> <span class="o">=</span> <span
class="s">true</span>
-<span class="na">a1.sinks.k1.hdfs.roundValue</span> <span class="o">=</span>
<span class="s">10</span>
-<span class="na">a1.sinks.k1.hdfs.roundUnit</span> <span class="o">=</span>
<span class="s">minute</span>
+<span class="na">a1.sinks.k1.hive.metastore</span> <span class="o">=</span>
<span class="s">thrift://127.0.0.1:9083</span>
+<span class="na">a1.sinks.k1.hive.database</span> <span class="o">=</span>
<span class="s">logsdb</span>
+<span class="na">a1.sinks.k1.hive.table</span> <span class="o">=</span> <span
class="s">weblogs</span>
+<span class="na">a1.sinks.k1.hive.partition</span> <span class="o">=</span>
<span class="s">asia,%{country},%y-%m-%d-%H-%M</span>
+<span class="na">a1.sinks.k1.useLocalTimeStamp</span> <span class="o">=</span>
<span class="s">false</span>
+<span class="na">a1.sinks.k1.round</span> <span class="o">=</span> <span
class="s">true</span>
+<span class="na">a1.sinks.k1.roundValue</span> <span class="o">=</span> <span
class="s">10</span>
+<span class="na">a1.sinks.k1.roundUnit</span> <span class="o">=</span> <span
class="s">minute</span>
+<span class="na">a1.sinks.k1.serializer</span> <span class="o">=</span> <span
class="s">DELIMITED</span>
+<span class="na">a1.sinks.k1.serializer.delimiter</span> <span
class="o">=</span> <span class="s">"\t"</span>
+<span class="na">a1.sinks.k1.serializer.serdeSeparator</span> <span
class="o">=</span> <span class="s">'\t'</span>
+<span class="na">a1.sinks.k1.serializer.fieldnames</span> <span
class="o">=</span><span class="s">id,,msg</span>
</pre></div>
</div>
<p>The above configuration will round down the timestamp to the last 10th
minute. For example, an event with
-timestamp 11:54:34 AM, June 12, 2012 will cause the hdfs path to become <tt
class="docutils literal"><span
class="pre">/flume/events/2012-06-12/1150/00</span></tt>.</p>
+timestamp header set to 11:54:34 AM, June 12, 2012 and ‘country’
header set to ‘india’ will evaluate to the
+partition
(continent=’asia’,country=’india’,time=‘2012-06-12-11-50’.
The serializer is configured to
+accept tab separated input containing three fields and to skip the second
field.</p>
</div>
<div class="section" id="logger-sink">
<h4>Logger Sink<a class="headerlink" href="#logger-sink" title="Permalink to
this headline">¶</a></h4>
@@ -2426,9 +2909,9 @@ timestamp 11:54:34 AM, June 12, 2012 wil
Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="21%" />
+<col width="20%" />
<col width="10%" />
-<col width="69%" />
+<col width="70%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name</th>
@@ -2445,6 +2928,10 @@ Required properties are in <strong>bold<
<td>–</td>
<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">logger</span></tt></td>
</tr>
+<tr class="row-even"><td>maxBytesToLog</td>
+<td>16</td>
+<td>Maximum number of bytes of the Event body to log</td>
+</tr>
</tbody>
</table>
<p>Example for agent named a1:</p>
@@ -2560,13 +3047,18 @@ Required properties are in <strong>bold<
<p>This sink forms one half of Flume’s tiered collection support. Flume
events
sent to this sink are turned into Thrift events and sent to the configured
hostname / port pair. The events are taken from the configured Channel in
-batches of the configured batch size.
+batches of the configured batch size.</p>
+<p>Thrift sink can be configured to start in secure mode by enabling kerberos
authentication.
+To communicate with a Thrift source started in secure mode, the Thrift sink
should also
+operate in secure mode. client-principal and client-keytab are the properties
used by the
+Thrift sink to authenticate to the kerberos KDC. The server-principal
represents the
+principal of the Thrift source this sink is configured to connect to in secure
mode.
Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="9%" />
+<col width="7%" />
<col width="2%" />
-<col width="89%" />
+<col width="91%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name</th>
@@ -2607,6 +3099,42 @@ Required properties are in <strong>bold<
<td>none</td>
<td>Amount of time (s) before the connection to the next hop is reset. This
will force the Thrift Sink to reconnect to the next hop. This will allow the
sink to connect to hosts behind a hardware load-balancer when news hosts are
added without having to restart the agent.</td>
</tr>
+<tr class="row-even"><td>ssl</td>
+<td>false</td>
+<td>Set to true to enable SSL for this ThriftSink. When configuring SSL, you
can optionally set a “truststore”,
“truststore-password” and “truststore-type”</td>
+</tr>
+<tr class="row-odd"><td>truststore</td>
+<td>–</td>
+<td>The path to a custom Java truststore file. Flume uses the certificate
authority information in this file to determine whether the remote Thrift
Source’s SSL authentication credentials should be trusted. If not
specified, the default Java JSSE certificate authority files (typically
“jssecacerts” or “cacerts” in the Oracle JRE) will be
used.</td>
+</tr>
+<tr class="row-even"><td>truststore-password</td>
+<td>–</td>
+<td>The password for the specified truststore.</td>
+</tr>
+<tr class="row-odd"><td>truststore-type</td>
+<td>JKS</td>
+<td>The type of the Java truststore. This can be “JKS” or other
supported Java truststore type.</td>
+</tr>
+<tr class="row-even"><td>exclude-protocols</td>
+<td>SSLv3</td>
+<td>Space-separated list of SSL/TLS protocols to exclude</td>
+</tr>
+<tr class="row-odd"><td>kerberos</td>
+<td>false</td>
+<td>Set to true to enable kerberos authentication. In kerberos mode,
client-principal, client-keytab and server-principal are required for
successful authentication and communication to a kerberos enabled Thrift
Source.</td>
+</tr>
+<tr class="row-even"><td>client-principal</td>
+<td>â-</td>
+<td>The kerberos principal used by the Thrift Sink to authenticate to the
kerberos KDC.</td>
+</tr>
+<tr class="row-odd"><td>client-keytab</td>
+<td>â-</td>
+<td>The keytab location used by the Thrift Sink in combination with the
client-principal to authenticate to the kerberos KDC.</td>
+</tr>
+<tr class="row-even"><td>server-principal</td>
+<td>–</td>
+<td>The kerberos principal of the Thrift Source to which the Thrift Sink is
configured to connect to.</td>
+</tr>
</tbody>
</table>
<p>Example for agent named a1:</p>
@@ -3092,11 +3620,13 @@ the more powerful ElasticSearchIndexRequ
</tr>
<tr class="row-odd"><td>indexName</td>
<td>flume</td>
-<td>The name of the index which the date will be appended to. Example
‘flume’ -> ‘flume-yyyy-MM-dd’</td>
+<td>The name of the index which the date will be appended to. Example
‘flume’ -> ‘flume-yyyy-MM-dd’
+Arbitrary header substitution is supported, eg. %{header} replaces with value
of named event header</td>
</tr>
<tr class="row-even"><td>indexType</td>
<td>logs</td>
-<td>The type to index the document to, defaults to ‘log’</td>
+<td>The type to index the document to, defaults to ‘log’
+Arbitrary header substitution is supported, eg. %{header} replaces with value
of named event header</td>
</tr>
<tr class="row-odd"><td>clusterName</td>
<td>elasticsearch</td>
@@ -3125,6 +3655,12 @@ either class are accepted but ElasticSea
</tr>
</tbody>
</table>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">Header substitution is a handy to use the value of an event
header to dynamically decide the indexName and indexType to use when storing
the event.
+Caution should be used in using this feature as the event submitter now has
control of the indexName and indexType.
+Furthermore, if the elasticsearch REST client is used then the event submitter
has control of the URL path used.</p>
+</div>
<p>Example for agent named a1:</p>
<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.channels</span> <span class="o">=</span> <span class="s">c1</span>
<span class="na">a1.sinks</span> <span class="o">=</span> <span
class="s">k1</span>
@@ -3140,18 +3676,12 @@ either class are accepted but ElasticSea
</pre></div>
</div>
</div>
-<div class="section" id="kite-dataset-sink-experimental">
-<h4>Kite Dataset Sink (experimental)<a class="headerlink"
href="#kite-dataset-sink-experimental" title="Permalink to this
headline">¶</a></h4>
-<div class="admonition warning">
-<p class="first admonition-title">Warning</p>
-<p class="last">This source is experimental and may change between minor
versions of Flume.
-Use at your own risk.</p>
-</div>
-<p>Experimental sink that writes events to a <a class="reference external"
href="http://kitesdk.org/docs/current/kite-data/guide.html">Kite Dataset</a>.
+<div class="section" id="kite-dataset-sink">
+<h4>Kite Dataset Sink<a class="headerlink" href="#kite-dataset-sink"
title="Permalink to this headline">¶</a></h4>
+<p>Experimental sink that writes events to a <a class="reference external"
href="http://kitesdk.org/docs/current/guide/">Kite Dataset</a>.
This sink will deserialize the body of each incoming event and store the
-resulting record in a Kite Dataset. It determines target Dataset by opening a
-repository URI, <tt class="docutils literal"><span
class="pre">kite.repo.uri</span></tt>, and loading a Dataset by name,
-<tt class="docutils literal"><span
class="pre">kite.dataset.name</span></tt>.</p>
+resulting record in a Kite Dataset. It determines target Dataset by loading a
+dataset by URI.</p>
<p>The only supported serialization is avro, and the record schema must be
passed
in the event headers, using either <tt class="docutils literal"><span
class="pre">flume.avro.schema.literal</span></tt> with the JSON
schema representation or <tt class="docutils literal"><span
class="pre">flume.avro.schema.url</span></tt> with a URL where the schema
@@ -3164,9 +3694,9 @@ has been exceeded. However, this delay w
cases, the delay is neglegible.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="26%" />
-<col width="8%" />
-<col width="66%" />
+<col width="28%" />
+<col width="7%" />
+<col width="65%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name</th>
@@ -3183,13 +3713,24 @@ cases, the delay is neglegible.</p>
<td>–</td>
<td>Must be org.apache.flume.sink.kite.DatasetSink</td>
</tr>
-<tr class="row-even"><td><strong>kite.repo.uri</strong></td>
+<tr class="row-even"><td><strong>kite.dataset.uri</strong></td>
+<td>–</td>
+<td>URI of the dataset to open</td>
+</tr>
+<tr class="row-odd"><td>kite.repo.uri</td>
+<td>–</td>
+<td>URI of the repository to open
+(deprecated; use kite.dataset.uri instead)</td>
+</tr>
+<tr class="row-even"><td>kite.dataset.namespace</td>
<td>–</td>
-<td>URI of the repository to open</td>
+<td>Namespace of the Dataset where records will be written
+(deprecated; use kite.dataset.uri instead)</td>
</tr>
-<tr class="row-odd"><td><strong>kite.dataset.name</strong></td>
+<tr class="row-odd"><td>kite.dataset.name</td>
<td>–</td>
-<td>Name of the Dataset where records will be written</td>
+<td>Name of the Dataset where records will be written
+(deprecated; use kite.dataset.uri instead)</td>
</tr>
<tr class="row-even"><td>kite.batchSize</td>
<td>100</td>
@@ -3199,15 +3740,56 @@ cases, the delay is neglegible.</p>
<td>30</td>
<td>Maximum wait time (seconds) before data files are released</td>
</tr>
-<tr class="row-even"><td>auth.kerberosPrincipal</td>
+<tr class="row-even"><td>kite.flushable.commitOnBatch</td>
+<td>true</td>
+<td>If <tt class="docutils literal"><span class="pre">true</span></tt>, the
Flume transaction will be commited and the
+writer will be flushed on each batch of <tt class="docutils literal"><span
class="pre">kite.batchSize</span></tt>
+records. This setting only applies to flushable datasets. When
+<tt class="docutils literal"><span class="pre">true</span></tt>, it’s
possible for temp files with commited data to be
+left in the dataset directory. These files need to be recovered
+by hand for the data to be visible to DatasetReaders.</td>
+</tr>
+<tr class="row-odd"><td>kite.syncable.syncOnBatch</td>
+<td>true</td>
+<td>Controls whether the sink will also sync data when committing
+the transaction. This setting only applies to syncable datasets.
+Syncing gaurentees that data will be written on stable storage
+on the remote system while flushing only gaurentees that data
+has left Flume’s client buffers. When the
+<tt class="docutils literal"><span
class="pre">kite.flushable.commitOnBatch</span></tt> property is set to <tt
class="docutils literal"><span class="pre">false</span></tt>,
+this property must also be set to <tt class="docutils literal"><span
class="pre">false</span></tt>.</td>
+</tr>
+<tr class="row-even"><td>kite.entityParser</td>
+<td>avro</td>
+<td>Parser that turns Flume <tt class="docutils literal"><span
class="pre">Events</span></tt> into Kite entities.
+Valid values are <tt class="docutils literal"><span
class="pre">avro</span></tt> and the fully-qualified class name
+of an implementation of the <tt class="docutils literal"><span
class="pre">EntityParser.Builder</span></tt> interface.</td>
+</tr>
+<tr class="row-odd"><td>kite.failurePolicy</td>
+<td>retry</td>
+<td>Policy that handles non-recoverable errors such as a missing
+<tt class="docutils literal"><span class="pre">Schema</span></tt> in the <tt
class="docutils literal"><span class="pre">Event</span></tt> header. The
default value, <tt class="docutils literal"><span class="pre">retry</span></tt>,
+will fail the current batch and try again which matches the old
+behavior. Other valid values are <tt class="docutils literal"><span
class="pre">save</span></tt>, which will write the
+raw <tt class="docutils literal"><span class="pre">Event</span></tt> to the
<tt class="docutils literal"><span
class="pre">kite.error.dataset.uri</span></tt> dataset, and the
+fully-qualified class name of an implementation of the
+<tt class="docutils literal"><span
class="pre">FailurePolicy.Builder</span></tt> interface.</td>
+</tr>
+<tr class="row-even"><td>kite.error.dataset.uri</td>
+<td>–</td>
+<td>URI of the dataset where failed events are saved when
+<tt class="docutils literal"><span class="pre">kite.failurePolicy</span></tt>
is set to <tt class="docutils literal"><span class="pre">save</span></tt>.
<strong>Required</strong> when
+the <tt class="docutils literal"><span
class="pre">kite.failurePolicy</span></tt> is set to <tt class="docutils
literal"><span class="pre">save</span></tt>.</td>
+</tr>
+<tr class="row-odd"><td>auth.kerberosPrincipal</td>
<td>–</td>
<td>Kerberos user principal for secure authentication to HDFS</td>
</tr>
-<tr class="row-odd"><td>auth.kerberosKeytab</td>
+<tr class="row-even"><td>auth.kerberosKeytab</td>
<td>–</td>
<td>Kerberos keytab location (local FS) for the principal</td>
</tr>
-<tr class="row-even"><td>auth.proxyUser</td>
+<tr class="row-odd"><td>auth.proxyUser</td>
<td>–</td>
<td>The effective user for HDFS actions, if different from
the kerberos principal</td>
@@ -3215,6 +3797,84 @@ the kerberos principal</td>
</tbody>
</table>
</div>
+<div class="section" id="kafka-sink">
+<h4>Kafka Sink<a class="headerlink" href="#kafka-sink" title="Permalink to
this headline">¶</a></h4>
+<p>This is a Flume Sink implementation that can publish data to a
+<a class="reference external" href="http://kafka.apache.org/">Kafka</a> topic.
One of the objective is to integrate Flume
+with Kafka so that pull based processing systems can process the data coming
+through various Flume sources. This currently supports Kafka 0.8.x series of
releases.</p>
+<p>Required properties are marked in bold font.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="20%" />
+<col width="12%" />
+<col width="68%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Property Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>type</strong></td>
+<td>–</td>
+<td>Must be set to <tt class="docutils literal"><span
class="pre">org.apache.flume.sink.kafka.KafkaSink</span></tt></td>
+</tr>
+<tr class="row-odd"><td><strong>brokerList</strong></td>
+<td>–</td>
+<td>List of brokers Kafka-Sink will connect to, to get the list of topic
partitions
+This can be a partial list of brokers, but we recommend at least two for HA.
+The format is comma separated list of hostname:port</td>
+</tr>
+<tr class="row-even"><td>topic</td>
+<td>default-flume-topic</td>
+<td>The topic in Kafka to which the messages will be published. If this
parameter is configured,
+messages will be published to this topic.
+If the event header contains a “topic” field, the event will be
published to that topic
+overriding the topic configured here.</td>
+</tr>
+<tr class="row-odd"><td>batchSize</td>
+<td>100</td>
+<td>How many messages to process in one batch. Larger batches improve
throughput while adding latency.</td>
+</tr>
+<tr class="row-even"><td>requiredAcks</td>
+<td>1</td>
+<td>How many replicas must acknowledge a message before its considered
successfully written.
+Accepted values are 0 (Never wait for acknowledgement), 1 (wait for leader
only), -1 (wait for all replicas)
+Set this to -1 to avoid data loss in some cases of leader failure.</td>
+</tr>
+<tr class="row-odd"><td>Other Kafka Producer Properties</td>
+<td>–</td>
+<td>These properties are used to configure the Kafka Producer. Any producer
property supported
+by Kafka can be used. The only requirement is to prepend the property name
with the prefix <tt class="docutils literal"><span
class="pre">kafka.</span></tt>.
+For example: kafka.producer.type</td>
+</tr>
+</tbody>
+</table>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">Kafka Sink uses the <tt class="docutils literal"><span
class="pre">topic</span></tt> and <tt class="docutils literal"><span
class="pre">key</span></tt> properties from the FlumeEvent headers to send
events to Kafka.
+If <tt class="docutils literal"><span class="pre">topic</span></tt> exists in
the headers, the event will be sent to that specific topic, overriding the
topic configured for the Sink.
+If <tt class="docutils literal"><span class="pre">key</span></tt> exists in
the headers, the key will used by Kafka to partition the data between the topic
partitions. Events with same key
+will be sent to the same partition. If the key is null, events will be sent to
random partitions.</p>
+</div>
+<p>An example configuration of a Kafka sink is given below. Properties starting
+with the prefix <tt class="docutils literal"><span
class="pre">kafka</span></tt> (the last 3 properties) are used when
instantiating
+the Kafka producer. The properties that are passed when creating the Kafka
+producer are not limited to the properties given in this example.
+Also it’s possible include your custom properties here and access them
inside
+the preprocessor through the Flume Context object passed in as a method
+argument.</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sinks.k1.type</span> <span class="o">=</span> <span
class="s">org.apache.flume.sink.kafka.KafkaSink</span>
+<span class="na">a1.sinks.k1.topic</span> <span class="o">=</span> <span
class="s">mytopic</span>
+<span class="na">a1.sinks.k1.brokerList</span> <span class="o">=</span> <span
class="s">localhost:9092</span>
+<span class="na">a1.sinks.k1.requiredAcks</span> <span class="o">=</span>
<span class="s">1</span>
+<span class="na">a1.sinks.k1.batchSize</span> <span class="o">=</span> <span
class="s">20</span>
+<span class="na">a1.sinks.k1.channel</span> <span class="o">=</span> <span
class="s">c1</span>
+</pre></div>
+</div>
+</div>
<div class="section" id="custom-sink">
<h4>Custom Sink<a class="headerlink" href="#custom-sink" title="Permalink to
this headline">¶</a></h4>
<p>A custom sink is your own implementation of the Sink interface. A custom
@@ -3412,14 +4072,100 @@ READ_COMMITTED, SERIALIZABLE, REPEATABLE
</pre></div>
</div>
</div>
+<div class="section" id="kafka-channel">
+<h4>Kafka Channel<a class="headerlink" href="#kafka-channel" title="Permalink
to this headline">¶</a></h4>
+<p>The events are stored in a Kafka cluster (must be installed separately).
Kafka provides high availability and
+replication, so in case an agent or a kafka broker crashes, the events are
immediately available to other sinks</p>
+<p>The Kafka channel can be used for multiple scenarios:
+* With Flume source and sink - it provides a reliable and highly available
channel for events
+* With Flume source and interceptor but no sink - it allows writing Flume
events into a Kafka topic, for use by other apps
+* With Flume sink, but no source - it is a low-latency, fault tolerant way to
send events from Kafka to Flume sources such as HDFS, HBase or Solr</p>
+<p>Required properties are in <strong>bold</strong>.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="14%" />
+<col width="16%" />
+<col width="70%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Property Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>type</strong></td>
+<td>–</td>
+<td>The component type name, needs to be <tt class="docutils literal"><span
class="pre">org.apache.flume.channel.kafka.KafkaChannel</span></tt></td>
+</tr>
+<tr class="row-odd"><td><strong>brokerList</strong></td>
+<td>–</td>
+<td>List of brokers in the Kafka cluster used by the channel
+This can be a partial list of brokers, but we recommend at least two for HA.
+The format is comma separated list of hostname:port</td>
+</tr>
+<tr class="row-even"><td><strong>zookeeperConnect</strong></td>
+<td>–</td>
+<td>URI of ZooKeeper used by Kafka cluster
+The format is comma separated list of hostname:port. If chroot is used, it is
added once at the end.
+For example: zookeeper-1:2181,zookeeper-2:2182,zookeeper-3:2181/kafka</td>
+</tr>
+<tr class="row-odd"><td>topic</td>
+<td>flume-channel</td>
+<td>Kafka topic which the channel will use</td>
+</tr>
+<tr class="row-even"><td>groupId</td>
+<td>flume</td>
+<td>Consumer group ID the channel uses to register with Kafka.
+Multiple channels must use the same topic and group to ensure that when one
agent fails another can get the data
+Note that having non-channel consumers with the same ID can lead to data
loss.</td>
+</tr>
+<tr class="row-odd"><td>parseAsFlumeEvent</td>
+<td>true</td>
+<td>Expecting Avro datums with FlumeEvent schema in the channel.
+This should be true if Flume source is writing to the channel
+And false if other producers are writing into the topic that the channel is
using
+Flume source messages to Kafka can be parsed outside of Flume by using
+org.apache.flume.source.avro.AvroFlumeEvent provided by the flume-ng-sdk
artifact</td>
+</tr>
+<tr class="row-even"><td>readSmallestOffset</td>
+<td>false</td>
+<td>When set to true, the channel will read all data in the topic, starting
from the oldest event
+when false, it will read only events written after the channel started
+When “parseAsFlumeEvent” is true, this will be false. Flume source
will start prior to the sinks and this
+guarantees that events sent by source before sinks start will not be lost.</td>
+</tr>
+<tr class="row-odd"><td>Other Kafka Properties</td>
+<td>–</td>
+<td>These properties are used to configure the Kafka Producer and Consumer
used by the channel.
+Any property supported by Kafka can be used.
+The only requirement is to prepend the property name with the prefix <tt
class="docutils literal"><span class="pre">kafka.</span></tt>.
+For example: kafka.producer.type</td>
+</tr>
+</tbody>
+</table>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">Due to the way the channel is load balanced, there may be
duplicate events when the agent first starts up</p>
+</div>
+<p>Example for agent named a1:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.channels.channel1.type</span> <span class="o">=</span> <span
class="s">org.apache.flume.channel.kafka.KafkaChannel</span>
+<span class="na">a1.channels.channel1.capacity</span> <span class="o">=</span>
<span class="s">10000</span>
+<span class="na">a1.channels.channel1.transactionCapacity</span> <span
class="o">=</span> <span class="s">1000</span>
+<span class="na">a1.channels.channel1.brokerList</span><span
class="o">=</span><span class="s">kafka-2:9092,kafka-3:9092</span>
+<span class="na">a1.channels.channel1.topic</span><span
class="o">=</span><span class="s">channel1</span>
+<span class="na">a1.channels.channel1.zookeeperConnect</span><span
class="o">=</span><span class="s">kafka-1:2181</span>
+</pre></div>
+</div>
+</div>
<div class="section" id="file-channel">
<h4>File Channel<a class="headerlink" href="#file-channel" title="Permalink to
this headline">¶</a></h4>
<p>Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="21%" />
-<col width="14%" />
-<col width="65%" />
+<col width="20%" />
+<col width="13%" />
+<col width="67%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name Default</th>
@@ -3480,31 +4226,35 @@ READ_COMMITTED, SERIALIZABLE, REPEATABLE
<td>false</td>
<td>Expert: Replay without using queue</td>
</tr>
-<tr class="row-odd"><td>encryption.activeKey</td>
+<tr class="row-odd"><td>checkpointOnClose</td>
+<td>true</td>
+<td>Controls if a checkpoint is created when the channel is closed. Creating a
checkpoint on close speeds up subsequent startup of the file channel by
avoiding replay.</td>
+</tr>
+<tr class="row-even"><td>encryption.activeKey</td>
<td>–</td>
<td>Key name used to encrypt new data</td>
</tr>
-<tr class="row-even"><td>encryption.cipherProvider</td>
+<tr class="row-odd"><td>encryption.cipherProvider</td>
<td>–</td>
<td>Cipher provider type, supported types: AESCTRNOPADDING</td>
</tr>
-<tr class="row-odd"><td>encryption.keyProvider</td>
+<tr class="row-even"><td>encryption.keyProvider</td>
<td>–</td>
<td>Key provider type, supported types: JCEKSFILE</td>
</tr>
-<tr class="row-even"><td>encryption.keyProvider.keyStoreFile</td>
+<tr class="row-odd"><td>encryption.keyProvider.keyStoreFile</td>
<td>–</td>
<td>Path to the keystore file</td>
</tr>
-<tr class="row-odd"><td>encrpytion.keyProvider.keyStorePasswordFile</td>
+<tr class="row-even"><td>encrpytion.keyProvider.keyStorePasswordFile</td>
<td>–</td>
<td>Path to the keystore password file</td>
</tr>
-<tr class="row-even"><td>encryption.keyProvider.keys</td>
+<tr class="row-odd"><td>encryption.keyProvider.keys</td>
<td>–</td>
<td>List of all keys (e.g. history of the activeKey setting)</td>
</tr>
-<tr class="row-odd"><td>encyption.keyProvider.keys.*.passwordFile</td>
+<tr class="row-even"><td>encyption.keyProvider.keys.*.passwordFile</td>
<td>–</td>
<td>Path to the optional key password file</td>
</tr>
@@ -3910,7 +4660,12 @@ that so long as one is available events
<p>The failover mechanism works by relegating failed sinks to a pool where
they are assigned a cool down period, increasing with sequential failures
before they are retried. Once a sink successfully sends an event, it is
-restored to the live pool.</p>
+restored to the live pool. The Sinks have a priority associated with them,
+larger the number, higher the priority. If a Sink fails while sending a Event
+the next Sink with highest priority shall be tried next for sending Events.
+For example, a sink with priority 100 is activated before the Sink with
priority
+80. If no priority is specified, thr priority is determined based on the order
in which
+the Sinks are specified in configuration.</p>
<p>To configure, set a sink groups processor to <tt class="docutils
literal"><span class="pre">failover</span></tt> and set
priorities for all individual sinks. All specified priorities must
be unique. Furthermore, upper limit to failover time can be set
@@ -3918,9 +4673,9 @@ be unique. Furthermore, upper limit to f
<p>Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
-<col width="26%" />
-<col width="9%" />
-<col width="66%" />
+<col width="23%" />
+<col width="8%" />
+<col width="70%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Property Name</th>
@@ -3939,11 +4694,12 @@ be unique. Furthermore, upper limit to f
</tr>
<tr
class="row-even"><td><strong>processor.priority.<sinkName></strong></td>
<td>–</td>
-<td><sinkName> must be one of the sink instances associated with the
current sink group</td>
+<td>Priority value. <sinkName> must be one of the sink instances
associated with the current sink group
+A higher priority value Sink gets activated earlier. A larger absolute value
indicates higher priority</td>
</tr>
<tr class="row-odd"><td>processor.maxpenalty</td>
<td>30000</td>
-<td>(in millis)</td>
+<td>The maximum backoff period for the failed Sink (in millis)</td>
</tr>
</tbody>
</table>
@@ -4347,6 +5103,61 @@ MorphlineInterceptor can also help to im
</pre></div>
</div>
</div>
+<div class="section" id="search-and-replace-interceptor">
+<h4>Search and Replace Interceptor<a class="headerlink"
href="#search-and-replace-interceptor" title="Permalink to this
headline">¶</a></h4>
+<p>This interceptor provides simple string-based search-and-replace
functionality
+based on Java regular expressions. Backtracking / group capture is also
available.
+This interceptor uses the same rules as in the Java Matcher.replaceAll()
method.</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="17%" />
+<col width="7%" />
+<col width="76%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Property Name</th>
+<th class="head">Default</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td><strong>type</strong></td>
+<td>–</td>
+<td>The component type name has to be <tt class="docutils literal"><span
class="pre">search_replace</span></tt></td>
+</tr>
+<tr class="row-odd"><td>searchPattern</td>
+<td>–</td>
+<td>The pattern to search for and replace.</td>
+</tr>
+<tr class="row-even"><td>replaceString</td>
+<td>–</td>
+<td>The replacement string.</td>
+</tr>
+<tr class="row-odd"><td>charset</td>
+<td>UTF-8</td>
+<td>The charset of the event body. Assumed by default to be UTF-8.</td>
+</tr>
+</tbody>
+</table>
+<p>Example configuration:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sources.avroSrc.interceptors</span> <span class="o">=</span>
<span class="s">search-replace</span>
+<span class="na">a1.sources.avroSrc.interceptors.search-replace.type</span>
<span class="o">=</span> <span class="s">search_replace</span>
+
+<span class="c"># Remove leading alphanumeric characters in an event
body.</span>
+<span
class="na">a1.sources.avroSrc.interceptors.search-replace.searchPattern</span>
<span class="o">=</span> <span class="s">^[A-Za-z0-9_]+</span>
+<span
class="na">a1.sources.avroSrc.interceptors.search-replace.replaceString</span>
<span class="o">=</span>
+</pre></div>
+</div>
+<p>Another example:</p>
+<div class="highlight-properties"><div class="highlight"><pre><span
class="na">a1.sources.avroSrc.interceptors</span> <span class="o">=</span>
<span class="s">search-replace</span>
+<span class="na">a1.sources.avroSrc.interceptors.search-replace.type</span>
<span class="o">=</span> <span class="s">search_replace</span>
+
+<span class="c"># Use grouping operators to reorder and munge words on a
line.</span>
+<span
class="na">a1.sources.avroSrc.interceptors.search-replace.searchPattern</span>
<span class="o">=</span> <span class="s">The quick brown ([a-z]+) jumped over
the lazy ([a-z]+)</span>
+<span
class="na">a1.sources.avroSrc.interceptors.search-replace.replaceString</span>
<span class="o">=</span> <span class="s">The hungry $2 ate the careless
$1</span>
+</pre></div>
+</div>
+</div>
<div class="section" id="regex-filtering-interceptor">
<h4>Regex Filtering Interceptor<a class="headerlink"
href="#regex-filtering-interceptor" title="Permalink to this
headline">¶</a></h4>
<p>This interceptor filters events selectively by interpreting the event body
as text and matching the text against a configured regular expression.
@@ -4509,7 +5320,7 @@ polling rather than terminating.</p>
<h2>Log4J Appender<a class="headerlink" href="#log4j-appender"
title="Permalink to this headline">¶</a></h2>
<p>Appends Log4j events to a flume agent’s avro source. A client using
this
appender must have the flume-ng-sdk in the classpath (eg,
-flume-ng-sdk-1.5.2.jar).
+flume-ng-sdk-1.6.0.jar).
Required properties are in <strong>bold</strong>.</p>
<table border="1" class="docutils">
<colgroup>
@@ -4589,7 +5400,7 @@ then the schema will be included as a Fl
<h2>Load Balancing Log4J Appender<a class="headerlink"
href="#load-balancing-log4j-appender" title="Permalink to this
headline">¶</a></h2>
<p>Appends Log4j events to a list of flume agent’s avro source. A client
using this
appender must have the flume-ng-sdk in the classpath (eg,
-flume-ng-sdk-1.5.2.jar). This appender supports a round-robin and random
+flume-ng-sdk-1.6.0.jar). This appender supports a round-robin and random
scheme for performing the load balancing. It also supports a configurable
backoff
timeout so that down agents are removed temporarily from the set of hosts
Required properties are in <strong>bold</strong>.</p>
@@ -4674,15 +5485,27 @@ send the events.</td>
</div>
<div class="section" id="security">
<h2>Security<a class="headerlink" href="#security" title="Permalink to this
headline">¶</a></h2>
-<p>The HDFS sink supports Kerberos authentication if the underlying HDFS is
-running in secure mode. Please refer to the HDFS Sink section for
-configuring the HDFS sink Kerberos-related options.</p>
+<p>The HDFS sink, HBase sink, Thrift source, Thrift sink and Kite Dataset sink
all support
+Kerberos authentication. Please refer to the corresponding sections for
+configuring the Kerberos-related options.</p>
+<p>Flume agent will authenticate to the kerberos KDC as a single principal,
which will be
+used by different components that require kerberos authentication. The
principal and
+keytab configured for Thrift source, Thrift sink, HDFS sink, HBase sink and
DataSet sink
+should be the same, otherwise the component will fail to start.</p>
</div>
<div class="section" id="monitoring">
<h2>Monitoring<a class="headerlink" href="#monitoring" title="Permalink to
this headline">¶</a></h2>
<p>Monitoring in Flume is still a work in progress. Changes can happen very
often.
Several Flume components report metrics to the JMX platform MBean server. These
metrics can be queried using Jconsole.</p>
+<div class="section" id="jmx-reporting">
+<h3>JMX Reporting<a class="headerlink" href="#jmx-reporting" title="Permalink
to this headline">¶</a></h3>
+<p>JMX Reporting can be enabled by specifying JMX parameters in the JAVA_OPTS
environment variable using
+flume-env.sh, like</p>
+<blockquote>
+<div>export JAVA_OPTS=”-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=5445
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false”</div></blockquote>
+<p>NOTE: The sample above disables the security. To enable Security, please
refer <a class="reference external"
href="http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html">http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html</a></p>
+</div>
<div class="section" id="ganglia-reporting">
<h3>Ganglia Reporting<a class="headerlink" href="#ganglia-reporting"
title="Permalink to this headline">¶</a></h3>
<p>Flume can also report these metrics to
@@ -4881,6 +5704,39 @@ metrics as long values.</p>
</div>
</div>
</div>
+<div class="section" id="tools">
+<h2>Tools<a class="headerlink" href="#tools" title="Permalink to this
headline">¶</a></h2>
+<div class="section" id="file-channel-integrity-tool">
+<h3>File Channel Integrity Tool<a class="headerlink"
href="#file-channel-integrity-tool" title="Permalink to this
headline">¶</a></h3>
+<p>File Channel Integrity tool verifies the integrity of individual Events in
the File channel
+and removes corrupted Events.</p>
+<p>The tools can be run as follows:</p>
+<div class="highlight-none"><div class="highlight"><pre>$bin/flume-ng tool
--conf ./conf FCINTEGRITYTOOL -l ./datadir
+</pre></div>
+</div>
+<p>where datadir the comma separated list of data directory to ve verified.</p>
+<p>Following are the options available</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="25%" />
+<col width="75%" />
+</colgroup>
+<thead valign="bottom">
+<tr class="row-odd"><th class="head">Option Name</th>
+<th class="head">Description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr class="row-even"><td>h/help</td>
+<td>Displays help</td>
+</tr>
+<tr class="row-odd"><td><strong>l/dataDirs</strong></td>
+<td>Comma-separated list of data directories which the tool must verify</td>
+</tr>
+</tbody>
+</table>
+</div>
+</div>
<div class="section" id="topology-design-considerations">
<h2>Topology Design Considerations<a class="headerlink"
href="#topology-design-considerations" title="Permalink to this
headline">¶</a></h2>
<p>Flume is very flexible and allows a large range of possible deployment
@@ -5303,7 +6159,7 @@ can be leveraged to move the Flume agent
<h3><a href="index.html">This Page</a></h3>
<ul>
-<li><a class="reference internal" href="#">Flume 1.5.2 User Guide</a><ul>
+<li><a class="reference internal" href="#">Flume 1.6.0 User Guide</a><ul>
<li><a class="reference internal" href="#introduction">Introduction</a><ul>
<li><a class="reference internal" href="#overview">Overview</a></li>
<li><a class="reference internal" href="#system-requirements">System
Requirements</a></li>
@@ -5322,6 +6178,7 @@ can be leveraged to move the Flume agent
<li><a class="reference internal" href="#wiring-the-pieces-together">Wiring
the pieces together</a></li>
<li><a class="reference internal" href="#starting-an-agent">Starting an
agent</a></li>
<li><a class="reference internal" href="#a-simple-example">A simple
example</a></li>
+<li><a class="reference internal"
href="#zookeeper-based-configuration">Zookeeper based Configuration</a></li>
<li><a class="reference internal"
href="#installing-third-party-plugins">Installing third-party plugins</a><ul>
<li><a class="reference internal" href="#the-plugins-d-directory">The
plugins.d directory</a></li>
<li><a class="reference internal"
href="#directory-layout-for-plugins">Directory layout for plugins</a></li>
@@ -5354,8 +6211,7 @@ can be leveraged to move the Flume agent
<li><a class="reference internal" href="#converter">Converter</a></li>
</ul>
</li>
-<li><a class="reference internal" href="#spooling-directory-source">Spooling
Directory Source</a></li>
-<li><a class="reference internal"
href="#twitter-1-firehose-source-experimental">Twitter 1% firehose Source
(experimental)</a><ul>
+<li><a class="reference internal" href="#spooling-directory-source">Spooling
Directory Source</a><ul>
<li><a class="reference internal" href="#event-deserializers">Event
Deserializers</a><ul>
<li><a class="reference internal" href="#line">LINE</a></li>
<li><a class="reference internal" href="#avro">AVRO</a></li>
@@ -5364,6 +6220,8 @@ can be leveraged to move the Flume agent
</li>
</ul>
</li>
+<li><a class="reference internal"
href="#twitter-1-firehose-source-experimental">Twitter 1% firehose Source
(experimental)</a></li>
+<li><a class="reference internal" href="#kafka-source">Kafka Source</a></li>
<li><a class="reference internal" href="#netcat-source">NetCat Source</a></li>
<li><a class="reference internal" href="#sequence-generator-source">Sequence
Generator Source</a></li>
<li><a class="reference internal" href="#syslog-sources">Syslog Sources</a><ul>
@@ -5377,6 +6235,7 @@ can be leveraged to move the Flume agent
<li><a class="reference internal" href="#blobhandler">BlobHandler</a></li>
</ul>
</li>
+<li><a class="reference internal" href="#stress-source">Stress Source</a></li>
<li><a class="reference internal" href="#legacy-sources">Legacy Sources</a><ul>
<li><a class="reference internal" href="#avro-legacy-source">Avro Legacy
Source</a></li>
<li><a class="reference internal" href="#thrift-legacy-source">Thrift Legacy
Source</a></li>
@@ -5388,6 +6247,7 @@ can be leveraged to move the Flume agent
</li>
<li><a class="reference internal" href="#flume-sinks">Flume Sinks</a><ul>
<li><a class="reference internal" href="#hdfs-sink">HDFS Sink</a></li>
+<li><a class="reference internal" href="#hive-sink">Hive Sink</a></li>
<li><a class="reference internal" href="#logger-sink">Logger Sink</a></li>
<li><a class="reference internal" href="#avro-sink">Avro Sink</a></li>
<li><a class="reference internal" href="#thrift-sink">Thrift Sink</a></li>
@@ -5401,13 +6261,15 @@ can be leveraged to move the Flume agent
</li>
<li><a class="reference internal"
href="#morphlinesolrsink">MorphlineSolrSink</a></li>
<li><a class="reference internal"
href="#elasticsearchsink">ElasticSearchSink</a></li>
-<li><a class="reference internal" href="#kite-dataset-sink-experimental">Kite
Dataset Sink (experimental)</a></li>
+<li><a class="reference internal" href="#kite-dataset-sink">Kite Dataset
Sink</a></li>
+<li><a class="reference internal" href="#kafka-sink">Kafka Sink</a></li>
<li><a class="reference internal" href="#custom-sink">Custom Sink</a></li>
</ul>
</li>
<li><a class="reference internal" href="#flume-channels">Flume Channels</a><ul>
<li><a class="reference internal" href="#memory-channel">Memory
Channel</a></li>
<li><a class="reference internal" href="#jdbc-channel">JDBC Channel</a></li>
+<li><a class="reference internal" href="#kafka-channel">Kafka Channel</a></li>
<li><a class="reference internal" href="#file-channel">File Channel</a></li>
[... 29 lines stripped ...]