git commit: FLUME-2105. Add docs for MorphlineSolrSink.

mpercy Mon, 24 Jun 2013 14:20:37 -0700

Updated Branches:
  refs/heads/flume-1.4 807215715 -> 00f1236b9


FLUME-2105. Add docs for MorphlineSolrSink.

(Wolfgang Hoschek via Mike Percy)


Project: http://git-wip-us.apache.org/repos/asf/flume/repo
Commit: http://git-wip-us.apache.org/repos/asf/flume/commit/00f1236b
Tree: http://git-wip-us.apache.org/repos/asf/flume/tree/00f1236b
Diff: http://git-wip-us.apache.org/repos/asf/flume/diff/00f1236b

Branch: refs/heads/flume-1.4
Commit: 00f1236b9073844475e51a09d4f51430538eb6be
Parents: 8072157
Author: Mike Percy <[email protected]>
Authored: Mon Jun 24 14:19:32 2013 -0700
Committer: Mike Percy <[email protected]>
Committed: Mon Jun 24 14:19:32 2013 -0700

----------------------------------------------------------------------
 flume-ng-doc/sphinx/FlumeUserGuide.rst | 118 ++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flume/blob/00f1236b/flume-ng-doc/sphinx/FlumeUserGuide.rst
----------------------------------------------------------------------
diff --git a/flume-ng-doc/sphinx/FlumeUserGuide.rst 
b/flume-ng-doc/sphinx/FlumeUserGuide.rst
index 816a5ed..f244aea 100644
--- a/flume-ng-doc/sphinx/FlumeUserGuide.rst
+++ b/flume-ng-doc/sphinx/FlumeUserGuide.rst
@@ -1007,6 +1007,19 @@ deserializer.schemaType         HASH            How the 
schema is represented. B
                                                 inefficient compared to 
``HASH`` mode.
 ==============================  ==============  
======================================================================
 
+BlobDeserializer
+^^^^^^^^^^^^^^^^
+
+This deserializer reads a Binary Large Object (BLOB) per event, typically one 
BLOB per file. For example a PDF or JPG file. Note that this approach is not 
suitable for very large objects because the entire BLOB is buffered in RAM.
+
+==========================  ==================  
=======================================================================
+Property Name               Default             Description
+==========================  ==================  
=======================================================================
+**deserializer**            --                  The FQCN of this class: 
``org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder``
+deserializer.maxBlobLength  100000000           The maximum number of bytes to 
read and buffer for a given request
+==========================  ==================  
=======================================================================
+
+
 NetCat Source
 ~~~~~~~~~~~~~
 
@@ -1265,6 +1278,17 @@ for list of events can be created by:
 
   Type type = new TypeToken<List<JSONEvent>>() {}.getType();
 
+BlobHandler
+'''''''''''
+By default HTTPSource splits JSON input into Flume events. As an alternative, 
BlobHandler is a handler for HTTPSource that returns an event that contains the 
request parameters as well as the Binary Large Object (BLOB) uploaded with this 
request. For example a PDF or JPG file. Note that this approach is not suitable 
for very large objects because it buffers up the entire BLOB in RAM.
+
+=====================  ==================  
============================================================================
+Property Name          Default             Description
+=====================  ==================  
============================================================================
+**handler**            --                  The FQCN of this class: 
``org.apache.flume.sink.solr.morphline.BlobHandler``
+handler.maxBlobLength  100000000           The maximum number of bytes to read 
and buffer for a given request
+=====================  ==================  
============================================================================
+
 Legacy Sources
 ~~~~~~~~~~~~~~
 
@@ -1804,6 +1828,56 @@ Example for agent named a1:
   a1.sinks.k1.serializer = 
org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
   a1.sinks.k1.channel = c1
 
+MorphlineSolrSink
+~~~~~~~~~~~~~~~~~
+
+This sink extracts data from Flume events, transforms it, and loads it in 
near-real-time into Apache Solr servers, which in turn serve queries to end 
users or search applications.
+
+This sink is well suited for use cases that stream raw data into HDFS (via the 
HdfsSink) and simultaneously extract, transform and load the same data into 
Solr (via MorphlineSolrSink). In particular, this sink can process arbitrary 
heterogeneous raw data from disparate data sources and turn it into a data 
model that is useful to Search applications.
+
+The ETL functionality is customizable using a `morphline configuration file 
<http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that 
defines a chain of transformation commands that pipe event records from one 
command to another. 
+
+Morphlines can be seen as an evolution of Unix pipelines where the data model 
is generalized to work with streams of generic records, including arbitrary 
binary payloads. A morphline command is a bit like a Flume Interceptor. 
Morphlines can be embedded into Hadoop components such as Flume.
+
+Commands to parse and transform a set of standard data formats such as log 
files, Avro, CSV, Text, HTML, XML, PDF, Word, Excel, etc. are provided out of 
the box, and additional custom commands and parsers for additional data formats 
can be added as morphline plugins. Any kind of data format can be indexed and 
any Solr documents for any kind of Solr schema can be generated, and any custom 
ETL logic can be registered and executed.
+
+Morphlines manipulate continuous streams of records. The data model can be 
described as follows: A record is a set of named fields where each field has an 
ordered list of one or more values. A value can be any Java Object. That is, a 
record is essentially a hash table where each hash table entry contains a 
String key and a list of Java Objects as values. (The implementation uses 
Guava's ``ArrayListMultimap``, which is a ``ListMultimap``). Note that a field 
can have multiple values and any two records need not use common field names. 
+
+This sink fills the body of the Flume event into the ``_attachment_body`` 
field of the morphline record, as well as copies the headers of the Flume event 
into record fields of the same name. The commands can then act on this data.
+
+Routing to a SolrCloud cluster is supported to improve scalability. Indexing 
load can be spread across a large number of MorphlineSolrSinks for improved 
scalability. Indexing load can be replicated across multiple MorphlineSolrSinks 
for high availability, for example using Flume features such as Load balancing 
Sink Processor. MorphlineInterceptor can also help to implement dynamic routing 
to multiple Solr collections (e.g. for multi-tenancy).
+
+The morphline and solr jars required for your environment must be placed in 
the lib directory of the Apache Flume installation. 
+
+The type is the FQCN: org.apache.flume.sink.solr.morphline.MorphlineSolrSink
+
+Required properties are in **bold**.
+
+===================  
=======================================================================  
========================
+Property Name        Default                                                   
               Description
+===================  
=======================================================================  
========================
+**channel**          --
+**type**             --                                                        
               The component type name, needs to be 
``org.apache.flume.sink.solr.morphline.MorphlineSolrSink``
+**morphlineFile**    --                                                        
               The relative or absolute path on the local file system to the 
morphline configuration file. Example: ``/etc/flume-ng/conf/morphline.conf``
+morphlineId          null                                                      
               Optional name used to identify a morphline if there are multiple 
morphlines in a morphline config file
+batchSize            1000                                                      
               The maximum number of events to take per flume transaction.
+batchDurationMillis  1000                                                      
               The maximum duration per flume transaction (ms). The transaction 
commits after this duration or when batchSize is exceeded, whichever comes 
first.
+handlerClass         org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl 
               The FQCN of a class implementing 
org.apache.flume.sink.solr.morphline.MorphlineHandler
+===================  
=======================================================================  
========================
+
+Example for agent named a1:
+
+.. code-block:: properties
+
+  a1.channels = c1
+  a1.sinks = k1
+  a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
+  a1.sinks.k1.channel = c1
+  a1.sinks.k1.morphlineFile = /etc/flume-ng/conf/morphline.conf
+  # a1.sinks.k1.morphlineId = morphline1
+  # a1.sinks.k1.batchSize = 1000
+  # a1.sinks.k1.batchDurationMillis = 1000
+
 ElasticSearchSink
 ~~~~~~~~~~~~~~~~~
 
@@ -2502,6 +2576,50 @@ Example for agent named a1:
   a1.sources.r1.interceptors.i1.key = datacenter
   a1.sources.r1.interceptors.i1.value = NEW_YORK
 
+UUID Interceptor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This interceptor sets a universally unique identifier on all events that are 
intercepted. An example UUID is ``b5755073-77a9-43c1-8fad-b7a586fc1b97``, which 
represents a 128-bit value.
+
+Consider using UUIDInterceptor to automatically assign a UUID to an event if 
no application level unique key for the event is available. It can be important 
to assign UUIDs to events as soon as they enter the Flume network; that is, in 
the first Flume Source of the flow. This enables subsequent deduplication of 
events in the face of replication and redelivery in a Flume network that is 
designed for high availability and high performance. If an application level 
key is available, this is preferable over an auto-generated UUID because it 
enables subsequent updates and deletes of event in data stores using said well 
known application level key.
+
+================  =======  
========================================================================
+Property Name     Default  Description
+================  =======  
========================================================================
+**type**          --       The component type name has to be 
``org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder``
+headerName        id       The name of the Flume header to modify
+preserveExisting  true     If the UUID header already exists, should it be 
preserved - true or false
+prefix            ""       The prefix string constant to prepend to each 
generated UUID
+================  =======  
========================================================================
+
+Morphline Interceptor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This interceptor filters the events through a `morphline configuration file 
<http://cloudera.github.io/cdk/docs/0.4.0/cdk-morphlines/index.html>`_ that 
defines a chain of transformation commands that pipe records from one command 
to another.
+For example the morphline can ignore certain events or alter or insert certain 
event headers via regular expression based pattern matching, or it can 
auto-detect and set a MIME type via Apache Tika on events that are intercepted. 
For example, this kind of packet sniffing can be used for content based dynamic 
routing in a Flume topology.
+MorphlineInterceptor can also help to implement dynamic routing to multiple 
Apache Solr collections (e.g. for multi-tenancy).
+
+Currently, there is a restriction in that the morphline of an interceptor must 
not generate more than one output record for each input event. This interceptor 
is not intended for heavy duty ETL processing - if you need this consider 
moving ETL processing from the Flume Source to a Flume Sink, e.g. to a 
MorphlineSolrSink.
+
+Required properties are in **bold**.
+
+=================  =======  
========================================================================
+Property Name      Default  Description
+=================  =======  
========================================================================
+**type**           --       The component type name has to be 
``org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder``
+**morphlineFile**  --       The relative or absolute path on the local file 
system to the morphline configuration file. Example: 
``/etc/flume-ng/conf/morphline.conf``
+morphlineId        null     Optional name used to identify a morphline if 
there are multiple morphlines in a morphline config file
+=================  =======  
========================================================================
+
+Sample flume.conf file:
+
+.. code-block:: properties
+
+  a1.sources.avroSrc.interceptors = morphlineinterceptor
+  a1.sources.avroSrc.interceptors.morphlineinterceptor.type = 
org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
+  a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineFile = 
/etc/flume-ng/conf/morphline.conf
+  a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineId = morphline1
+
 Regex Filtering Interceptor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

git commit: FLUME-2105. Add docs for MorphlineSolrSink.

Reply via email to