falcon git commit: FALCON-1806 : Update documentation for Import and Export. Contributed by Venkatesan Ramachandran

balu Thu, 04 Feb 2016 15:02:56 -0800

Repository: falcon
Updated Branches:
  refs/heads/master 5abb15575 -> 66c6e4e85



FALCON-1806 : Update documentation for Import and Export. Contributed by 
Venkatesan Ramachandran


Project: http://git-wip-us.apache.org/repos/asf/falcon/repo
Commit: http://git-wip-us.apache.org/repos/asf/falcon/commit/66c6e4e8
Tree: http://git-wip-us.apache.org/repos/asf/falcon/tree/66c6e4e8
Diff: http://git-wip-us.apache.org/repos/asf/falcon/diff/66c6e4e8

Branch: refs/heads/master
Commit: 66c6e4e8542ee778b2aa8b60f24baa6f6576609a
Parents: 5abb155
Author: bvellanki <[email protected]>
Authored: Thu Feb 4 15:02:17 2016 -0800
Committer: bvellanki <[email protected]>
Committed: Thu Feb 4 15:02:17 2016 -0800

----------------------------------------------------------------------
 CHANGES.txt                                   |   2 +
 docs/src/site/twiki/FalconDocumentation.twiki |   5 +
 docs/src/site/twiki/ImportExport.twiki        | 242 +++++++++++++++++++++
 3 files changed, 249 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/falcon/blob/66c6e4e8/CHANGES.txt
----------------------------------------------------------------------
diff --git a/CHANGES.txt b/CHANGES.txt
index afefbc8..fafe6d5 100755
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -26,6 +26,8 @@ Trunk
     FALCON-1770 Update README file (Ajay Yadava)
 
   BUG FIXES
+    FALCON-1806 Update documentation for Import and Export. (Venkatesan 
Ramachandran via Balu Vellanki)
+
     FALCON-1787 Ooozie pig-action.xml requires hive sharelib for HCatalog 
use(Sowmya Ramesh via Ajay Yadava)
 
     FALCON-1792 Upgrade hadoop.version to 2.6.2 (Venkatesan Ramachandran via 
Peeyush Bishnoi)

http://git-wip-us.apache.org/repos/asf/falcon/blob/66c6e4e8/docs/src/site/twiki/FalconDocumentation.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/FalconDocumentation.twiki 
b/docs/src/site/twiki/FalconDocumentation.twiki
index 21b22ff..122435a 100644
--- a/docs/src/site/twiki/FalconDocumentation.twiki
+++ b/docs/src/site/twiki/FalconDocumentation.twiki
@@ -18,6 +18,7 @@
    * <a href="#Email_Notification">Email Notification</a>
    * <a href="#Backwards_Compatibility">Backwards Compatibility 
Instructions</a>
    * <a href="#Proxyuser_support">Proxyuser support</a>
+   * <a href="#ImportExport">Data Import and Export</a>
 
 ---++ Architecture
 
@@ -768,5 +769,9 @@ the wildcard value * may be used to allow impersonation 
from any host or of any
 
 -doAs option via  CLI or doAs query parameter can be appended if using API to 
enable impersonation.
 
+---++ ImportExport
+
+Data Import and Export is detailed in [[ImportExport][Data Import and Export]].
+
 
 

http://git-wip-us.apache.org/repos/asf/falcon/blob/66c6e4e8/docs/src/site/twiki/ImportExport.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/ImportExport.twiki 
b/docs/src/site/twiki/ImportExport.twiki
new file mode 100644
index 0000000..b0ce7ff
--- /dev/null
+++ b/docs/src/site/twiki/ImportExport.twiki
@@ -0,0 +1,242 @@
+---+Falcon Data Import and Export
+
+
+---++Overview
+
+Falcon provides constructs to periodically bring raw data from external data 
sources (like databases, drop boxes etc)
+onto Hadoop and push derived data computed on Hadoop onto external data 
sources.
+
+As of this release, Falcon only supports Relational Databases (e.g. Oracle, 
MySQL etc) via JDBC as external data source.
+The future releases will add support for other external data sources.
+
+
+---++Prerequisites
+
+Following are the prerequisites to import external data from and export to 
databases.
+
+   * *Sqoop 1.4.6+*
+   * *Oozie 4.2.0+*
+   * *Appropriate database connector*
+
+
+*Note:* Falcon uses Sqoop for import/export operation. Sqoop will require 
appropriate database driver to connect to
+the relational database. Please refer to the Sqoop documentation for any Sqoop 
related question. Please make sure
+the database driver jar is copied into oozie share lib for Sqoop.
+
+<verbatim>
+For example, in order to import and export with MySQL, please make sure the 
latest MySQL connector
+*mysql-connector-java-5.1.31.jar+* is copied into oozie's Sqoop share lib
+
+/user/oozie/share/lib/{lib-dir}/sqoop/mysql-connector-java-5.1.31.jar+
+
+where {lib-dir} value varies in oozie deployments.
+
+</verbatim>
+
+---++ Usage
+---+++ Entity Definition and Setup
+   * *Datasource Entity*
+      Datasource entity abstracts connection and credential details to 
external data sources. The Datasource entity
+      supports read and write interfaces with specific credentials. The 
default credential will be used if the read
+      or write interface does not have its own credentials. In general, the 
Datasource entity will be defined by
+      system administrator. Please refer to datasource XSD for more details.
+
+      The following example defines a Datasource entity for a MySQL database. 
The import operation will use
+      the read interface with url "jdbc:mysql://dbhost/test", user name 
"import_usr" and password text "sqoop".
+      Where as, the export operation will use the write interface with url 
"jdbc:mysql://dbhost/test" with user
+      name "export_usr" and password specified in a HDFS file at the location 
"/user/ambari-qa/password-store/password_write_user".
+
+      The default credential specified will be used if either the read or 
write interface does not provide its own
+      credentials. The default credential specifies the password using 
password alias feature available via hadoop credential
+      functionality. User will be able to create a password alias using 
"hadoop credential -create <alias> -provider
+      <provider-path>" command, where <alias> is a string and <provider-path> 
is a HDFS jceks file. During runtime,
+      the specified alias will be used to look up the password stored 
encrypted in the jceks hdfs file specified under
+      the providerPath element.
+
+      The available read and write interfaces enable database administrators 
to segregate read and write workloads.
+
+      <verbatim>
+
+      File: mysql-database.xml
+
+      <?xml version="1.0" encoding="UTF-8"?>
+      <datasource colo="west-coast" description="MySQL database on west coast" 
type="mysql" name="mysql-db" xmlns="uri:falcon:datasource:0.1">
+          <tags>[email protected], 
[email protected]</tags>
+          <interfaces>
+              <!-- ***** read interface ***** -->
+              <interface type="readonly" endpoint="jdbc:mysql://dbhost/test">
+                  <credential type="password-text">
+                      <userName>import_usr</userName>
+                      <passwordText>sqoop</passwordFile>
+                  </credential>
+              </interface>
+
+              <!-- ***** write interface ***** -->
+              <interface type="write"  endpoint="jdbc:mysql://dbhost/test">
+                  <credential type="password-file">
+                      <userName>export_usr</userName>
+                      
<passwordFile>/user/ambari-qa/password-store/password_write_user</passwordFile>
+                  </credential>
+              </interface>
+
+              <!-- *** default credential *** -->
+              <credential type="password-alias">
+                <userName>sqoop2_user</userName>
+                <passwordAlias>
+                    <alias>sqoop.password.alias</alias>
+                    
<providerPath>hdfs://namenode:8020/user/ambari-qa/sqoop_password.jceks</providerPath>
+                </passwordAlias>
+              </credential>
+
+          </interfaces>
+
+          <driver>
+              <clazz>com.mysql.jdbc.Driver</clazz>
+              
<jar>/user/oozie/share/lib/lib_20150721010816/sqoop/mysql-connector-java-5.1.31</jar>
+          </driver>
+      </datasource>
+      </verbatim>
+
+   * *Feed  Entity*
+      Feed entity now enables users to define IMPORT and EXPORT policies in 
addition to RETENTION and REPLICATION.
+      The IMPORT and EXPORT policies will refer to a already defined 
Datasource entity for connection and credential
+      details and take a table name from the policy to operate on. Please 
refer to feed entity XSD for details.
+
+      The following example defines a Feed entity with IMPORT and EXPORT 
policies. Both the IMPORT and EXPORT operations
+      refer to a datasource entity "mysql-db". The IMPORT operation will use 
the read interface and credentials while
+      the EXPORT operation will use the write interface and credentials. A 
feed instance is created every 1 hour
+      since the frequency of the Feed is hour(1) and the Feed instances are 
deleted after 90 days because of the
+      retention policy.
+
+
+      <verbatim>
+
+      File: customer_email_feed.xml
+
+      <?xml version="1.0" encoding="UTF-8"?>
+      <!--
+       A feed representing Hourly customer email data retained for 90 days
+       -->
+      <feed description="Raw customer email feed" name="customer_feed" 
xmlns="uri:falcon:feed:0.1">
+          <tags>externalSystem=USWestEmailServers,classification=secure</tags>
+          <groups>DataImportPipeline</groups>
+          <frequency>hours(1)</frequency>
+          <late-arrival cut-off="hours(4)"/>
+          <clusters>
+              <cluster name="primaryCluster" type="source">
+                  <validity start="2015-12-15T00:00Z" end="2016-03-31T00:00Z"/>
+                  <retention limit="days(90)" action="delete"/>
+                  <import>
+                      <source name="mysql-db" tableName="simple">
+                          <extract type="full">
+                              <mergepolicy>snapshot</mergepolicy>
+                          </extract>
+                          <fields>
+                              <includes>
+                                  <field>id</field>
+                                  <field>name</field>
+                              </includes>
+                          </fields>
+                      </source>
+                      <arguments>
+                          <argument name="--split-by" value="id"/>
+                          <argument name="--num-mappers" value="2"/>
+                      </arguments>
+                  </import>
+                  <export>
+                        <target name="mysql-db" tableName="simple_export">
+                            <load type="insert"/>
+                            <fields>
+                              <includes>
+                                <field>id</field>
+                                <field>name</field>
+                              </includes>
+                            </fields>
+                        </target>
+                        <arguments>
+                             <argument name="--update-key" value="id"/>
+                        </arguments>
+                    </export>
+              </cluster>
+          </clusters>
+
+          <locations>
+              <location type="data" 
path="/user/ambari-qa/falcon/demo/primary/importfeed/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}"/>
+              <location type="stats" path="/none"/>
+              <location type="meta" path="/none"/>
+          </locations>
+
+          <ACL owner="ambari-qa" group="users" permission="0755"/>
+          <schema location="/none" provider="none"/>
+
+      </feed>
+      </verbatim>
+
+   * *Import policy*
+     The import policy uses the datasource entity specified in the "source" to 
connect to the database. The tableName
+     specified should exist in the source datasource.
+
+     Extraction type specifies whether to pull data from external datasource 
"full" everytime or "incrementally".
+     The mergepolicy specifies how to organize (snapshot or append, i.e time 
series partiitons) the data on hadoop.
+     The valid combinations are:
+      * [full,snapshot] - data is extracted in full and dumped into the feed 
instance location.
+      * [incremental, append] - data is extracted incrementally using the key 
specified in the *deltacolumn*
+        and added as a partition to the feed instance location.
+      * [incremental, snapshot] - data is extracted incrementally and merged 
with already existing data on hadoop to
+        produce one latest feed instance.*This feature is not supported 
currently*. The use case for this feature is
+        to efficiently import very large dimention tables that have updates 
and inserts onto hadoop and make it available
+        as a snapshot with latest updates to consumers.
+
+      The following example defines an incremental extraction with append 
organization:
+
+      <verbatim>
+           <import>
+                <source name="mysql-db" tableName="simple">
+                    <extract type="incremental">
+                        <deltacolumn>modified_time</deltacolumn>
+                        <mergepolicy>append</mergepolicy>
+                    </extract>
+                    <fields>
+                        <includes>
+                            <field>id</field>
+                            <field>name</field>
+                        </includes>
+                    </fields>
+                </source>
+                <arguments>
+                    <argument name="--split-by" value="id"/>
+                    <argument name="--num-mappers" value="2"/>
+                </arguments>
+            </import>
+        </verbatim>
+
+
+     The fields option enables to control what fields get imported. By 
default, all fields get import. The "includes" option
+     brings only those fields specified. The "excludes" option brings all the 
fields other than specified.
+
+     The arguments section enables to pass in any extra arguments needed for 
fine control on the underlying implementation --
+     in this case, Sqoop.
+
+   * *Export policy*
+     The export, like import, uses the datasource for connecting to the 
database. Load type specifies whether to insert
+     or only update data onto the external table. Fields option behaves the 
same way as in import policy.
+     The tableName specified should exist in the external datasource.
+
+---+++ Operation
+   Once the Datasource and Feed entity with import and export policies are 
defined, Users can submit and schedule
+   the Import and Export operations via CLI and REST API as below:
+
+   <verbatim>
+
+    ## submit the mysql-db datasource defined in the file mysql_datasource.xml
+    falcon entity -submit -type datasource -file mysql_datasource.xml
+
+    ## submit the customer_feed specified in the customer_email_feed.xml
+    falcon entity -submit -type feed -file customer_email_feed.xml
+
+    ## schedule the customer_feed
+    falcon entity -schedule -type feed -name customer_feed
+
+   </verbatim>
+
+   Falcon will create corresponding oozie bundles with coordinator and 
workflow for import and export operation.

falcon git commit: FALCON-1806 : Update documentation for Import and Export. Contributed by Venkatesan Ramachandran

Reply via email to