falcon git commit: FALCON-1562 Documentation for enabling native scheduler in falcon

pallavi Tue, 15 Dec 2015 04:19:35 -0800

Repository: falcon
Updated Branches:
  refs/heads/master 7f4ff1a37 -> 8be383ebe



FALCON-1562 Documentation for enabling native scheduler in falcon


Project: http://git-wip-us.apache.org/repos/asf/falcon/repo
Commit: http://git-wip-us.apache.org/repos/asf/falcon/commit/8be383eb
Tree: http://git-wip-us.apache.org/repos/asf/falcon/tree/8be383eb
Diff: http://git-wip-us.apache.org/repos/asf/falcon/diff/8be383eb

Branch: refs/heads/master
Commit: 8be383ebe43b37deb2f24fcbdb2e6e230c8deed9
Parents: 7f4ff1a
Author: Pallavi Rao <[email protected]>
Authored: Tue Dec 15 17:48:32 2015 +0530
Committer: Pallavi Rao <[email protected]>
Committed: Tue Dec 15 17:48:32 2015 +0530

----------------------------------------------------------------------
 CHANGES.txt                                     |   2 +
 docs/src/site/twiki/Configuration.twiki         |   4 +
 docs/src/site/twiki/FalconDocumentation.twiki   |   2 +
 docs/src/site/twiki/FalconNativeScheduler.twiki | 163 +++++++++++++++++++
 docs/src/site/twiki/falconcli/Schedule.twiki    |   4 +-
 src/conf/startup.properties                     |   8 +-
 6 files changed, 176 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/CHANGES.txt
----------------------------------------------------------------------
diff --git a/CHANGES.txt b/CHANGES.txt
index 86a5c9a..0f773de 100755
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -9,6 +9,8 @@ Trunk (Unreleased)
   INCOMPATIBLE CHANGES
 
   NEW FEATURES
+    FALCON-1562 Documentation for enabling native scheduler in falcon (Pallavi 
Rao)
+
     FALCON-1512 Implement touch feature for native scheduler (Pallavi Rao)
 
     FALCON-1233 Support co-existence of Oozie scheduler (coord) and Falcon 
native scheduler (Pallavi Rao)

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/Configuration.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/Configuration.twiki 
b/docs/src/site/twiki/Configuration.twiki
index 74da49a..0df094f 100644
--- a/docs/src/site/twiki/Configuration.twiki
+++ b/docs/src/site/twiki/Configuration.twiki
@@ -100,6 +100,10 @@ Some Falcon features such as late data handling, retries, 
metadata service, depe
    * Copy notification related properties in oozie/conf/oozie-site.xml to 
oozie-site.xml of the Oozie installation.  Restart Oozie so changes get 
reflected.  
 
 *NOTE : If you disable Falcon post-processing JMS notification and not enable 
Oozie JMS notification, features such as failure retry, late data handling and 
metadata service will be disabled for all entities on the server.*
+
+---+++Enabling Falcon Native Scheudler
+You can either choose to schedule entities using Oozie's coordinator or using 
Falcon's native scheduler. To be able to schedule entities natively on Falcon, 
you will need to add some additional properties to 
<verbatim>$FALCON_HOME/conf/startup.properties</verbatim> before starting the 
Falcon Server. For details on the same, refer to 
[[FalconNativeScheduler][Falcon Native Scheduler]]
+
 ---+++Adding Extension Libraries
 
 Library extensions allows users to add custom libraries to entity lifecycles 
such as feed retention, feed replication

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/FalconDocumentation.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/FalconDocumentation.twiki 
b/docs/src/site/twiki/FalconDocumentation.twiki
index f384a42..95f388a 100644
--- a/docs/src/site/twiki/FalconDocumentation.twiki
+++ b/docs/src/site/twiki/FalconDocumentation.twiki
@@ -37,6 +37,8 @@ Falcon system has picked Oozie as the default scheduler. 
However the system is o
 other schedulers. Lot of the data processing in hadoop requires scheduling to 
be based on both data availability
 as well as time. Oozie currently supports these capabilities off the shelf and 
hence the choice.
 
+While the use of Oozie works reasonably well, there are scenarios where Oozie 
scheduling is proving to be a limiting factor. In its current form, Falcon 
relies on Oozie for both scheduling and for workflow execution, due to which 
the scheduling is limited to time based/cron based scheduling with additional 
gating conditions on data availability. Also, this imposes restrictions on 
datasets being periodic/cyclic in nature. In order to offer better scheduling 
capabilities, Falcon comes with its own native scheduler. Refer to 
[[FalconNativeScheduler][Falcon Native Scheduler]] for details.
+
 ---+++ Control flow
 Though the actual responsibility of the workflow is with the scheduler 
(Oozie), Falcon remains in the
 execution path, by subscribing to messages that each of the workflow may 
generate. When Falcon generates a

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/FalconNativeScheduler.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/FalconNativeScheduler.twiki 
b/docs/src/site/twiki/FalconNativeScheduler.twiki
new file mode 100644
index 0000000..9403ae7
--- /dev/null
+++ b/docs/src/site/twiki/FalconNativeScheduler.twiki
@@ -0,0 +1,163 @@
+---+ Falcon Native Scheduler
+
+---++ Overview
+Falcon has been using Oozie as its scheduling engine.  While the use of Oozie 
works reasonably well, there are scenarios where Oozie scheduling is proving to 
be a limiting factor. In its current form, Falcon relies on Oozie for both 
scheduling and for workflow execution, due to which the scheduling is limited 
to time based/cron based scheduling with additional gating conditions on data 
availability. Also, this imposes restrictions on datasets being periodic in 
nature. In order to offer better scheduling capabilities, Falcon comes with its 
own native scheduler. 
+
+---++ Capabilities
+The native scheduler will offer the capabilities offered by Oozie co-ordinator 
and more. The native scheduler will be built and released over the next few 
releases of Falcon giving users an opportunity to use it and provide feedback.
+
+Currently, the native scheduler offers the following capabilities:
+   1. Submit and schedule a Falcon process that runs periodically (without 
data dependency) - It could be a PIG script, oozie workflow, Hive (all the 
engine types currently supported).
+   1. Monitor/Query/Modify the scheduled process - All applicable entity APIs 
and instance APIs should work as it does now.  Falcon provides data management 
functions for feeds declaratively. It allows users to represent feed locations 
as time-based partition directories on HDFS containing files.
+
+*NOTE: Execution order is FIFO. LIFO and LAST_ONLY are not supported yet.*
+
+In the near future, Falcon scheduler will provide feature parity with Oozie 
scheduler and in subsequent releases will provide the following features:
+   * Periodic, cron-based, calendar-based scheduling.
+   * Data availability based scheduling.
+   * External trigger/notification based scheduling.
+   * Support for periodic/a-periodic datasets.
+   * Support for optional/mandatory datasets. Option to specify 
minumum/maximum/exactly-N instances of data to consume.
+   * Handle dependencies across entities during re-run.
+
+---++ Configuring Native Scheduler
+You can enable native scheduler by making changes to 
__$FALCON_HOME/conf/startup.properties__ as follows. You will need to restart 
Falcon Server for the changes to take effect.
+<verbatim>
+*.dag.engine.impl=org.apache.falcon.workflow.engine.OozieDAGEngine
+*.application.services=org.apache.falcon.security.AuthenticationInitializationService,\
+                        
org.apache.falcon.workflow.WorkflowJobEndNotificationService, \
+                        org.apache.falcon.service.ProcessSubscriberService,\
+                        org.apache.falcon.service.FeedSLAMonitoringService,\
+                        org.apache.falcon.service.LifecyclePolicyMap,\
+                        
org.apache.falcon.state.store.service.FalconJPAService,\
+                        org.apache.falcon.entity.store.ConfigurationStore,\
+                        org.apache.falcon.rerun.service.RetryService,\
+                        org.apache.falcon.rerun.service.LateRunService,\
+                        org.apache.falcon.metadata.MetadataMappingService,\
+                        org.apache.falcon.service.LogCleanupService,\
+                        org.apache.falcon.service.GroupsService,\
+                        org.apache.falcon.service.ProxyUserService,\
+                        
org.apache.falcon.notification.service.impl.JobCompletionService,\
+                        
org.apache.falcon.notification.service.impl.SchedulerService,\
+                        
org.apache.falcon.notification.service.impl.AlarmService,\
+                        
org.apache.falcon.notification.service.impl.DataAvailabilityService,\
+                        org.apache.falcon.execution.FalconExecutionService
+</verbatim>
+
+---+++ Making the Native Scheduler the default scheduler
+To ensure backward compatibility, even when the native scheduler is enabled, 
the default scheduler is still Oozie. This means users will be scheduling 
entities on Oozie scheduler, by default. They will need to explicitly specify 
the scheduler as native, if they wish to schedule entities using native 
scheduler. 
+
+<a href="#Scheduling_new_entities_on_Native_Scheduler">This section</a> has 
more details on how to schedule on either of the schedulers. 
+
+If you wish to make the Falcon Native Scheduler your default scheduler and 
remove Oozie as the scheduler, set the following property in 
__$FALCON_HOME/conf/startup.properties__
+<verbatim>
+## If you wish to use Falcon native scheduler as your default scheduler, set 
the workflow engine to FalconWorkflowEngine instead of OozieWorkflowEngine. ##
+*.workflow.engine.impl=org.apache.falcon.workflow.engine.FalconWorkflowEngine
+</verbatim>
+
+---+++ Configuring the state store for Native Scheduler
+
+Falcon Server needs to maintain state of the entities and instances in a 
persistent store for the system to be recoverable. Since Prism only federates, 
it does not need to maintain any state information. Following properties need 
to be set in startup.properties of Falcon Servers:
+<verbatim>
+######### StateStore Properties #####
+*.falcon.state.store.impl=org.apache.falcon.state.store.jdbc.JDBCStateStore
+*.falcon.statestore.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
+*.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db
+*.falcon.statestore.jdbc.username=sa
+*.falcon.statestore.jdbc.password=
+*.falcon.statestore.connection.data.source=org.apache.commons.dbcp.BasicDataSource
+# Maximum number of active connections that can be allocated from this pool at 
the same time.
+*.falcon.statestore.pool.max.active.conn=10
+*.falcon.statestore.connection.properties=
+# Indicates the interval (in milliseconds) between eviction runs.
+*.falcon.statestore.validate.db.connection.eviction.interval=300000
+## The number of objects to examine during each run of the idle object evictor 
thread.
+*.falcon.statestore.validate.db.connection.eviction.num=10
+## Creates Falcon DB.
+## If set to true, it creates the DB schema if it does not exist. If the DB 
schema exists is a NOP.
+## If set to false, it does not create the DB schema. If the DB schema does 
not exist it fails start up.
+*.falcon.statestore.create.db.schema=true
+</verbatim> 
+
+The _*.falcon.statestore.jdbc.url_ property in startup.properties determines 
the DB and data location. All other properties are common across RDBMS.
+
+*NOTE : Although multiple Falcon Servers can share a DB (not applicable for 
Derby DB), it is recommended that you have different DBs for different Falcon 
Servers for better performance.*
+
+You will need to create the state DB and tables before starting the Falcon 
Server. To create tables, a tool comes bundled with the Falcon installation. 
You can use the _falcon-db.sh_ script to create tables in the DB. The script 
needs to be run only for Falcon Servers and can be run by any user that has 
execute permission on the script. The script picks up the DB connection details 
from __$FALCON_HOME/conf/startup.properties__. Ensure that you have granted the 
right privileges to the user mentioned in _startup.properties_, so the tables 
can be created.  
+
+You can use the help command to get details on the sub-commands supported:
+<verbatim>
+./bin/falcon-db.sh help
+Hadoop home is set, adding libraries from 
'/Users/pallavi.rao/falcon/hadoop-2.6.0/bin/hadoop classpath' into falcon 
classpath
+usage: 
+      Falcon DB initialization tool currently supports Derby DB/ Mysql
+
+      falcondb help : Display usage for all commands or specified command
+
+      falcondb version : Show Falcon DB version information
+
+      falcondb create <OPTIONS> : Create Falcon DB schema
+                      -run             Confirmation option regarding DB schema 
creation/upgrade
+                      -sqlfile <arg>   Generate SQL script instead of 
creating/upgrading the DB
+                                       schema
+
+      falcondb upgrade <OPTIONS> : Upgrade Falcon DB schema
+                       -run             Confirmation option regarding DB 
schema creation/upgrade
+                       -sqlfile <arg>   Generate SQL script instead of 
creating/upgrading the DB
+                                        schema
+
+</verbatim>
+Currently, MySQL and Derby are supported as state stores. We may extend 
support to other DBs in the future. Falcon has been tested against MySQL v5.5.
+
+---++++ Using Derby as the State Store
+Using Derby is ideal for QA and staging setup. Falcon comes bundled with a 
Derby connector and no explicit setup is required (although you can set it up) 
in terms creating the DB or tables.
+For example,
+ <verbatim> *.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db;create=true 
</verbatim>
+
+ tells Falcon to use the Derby JDBC connector, with data directory, 
$FALCON_HOME/data/ and DB name 'falcon'. If _create=true_ is specified, you 
will not need to create a DB up front; a database will be created if it does 
not exist.
+
+---++++ Using MySQL as the State Store
+The jdbc.url property in startup.properties determines the DB and data 
location.
+For example,
+ <verbatim> *.falcon.statestore.jdbc.url=jdbc:mysql://localhost:3306/falcon 
</verbatim>
+
+ tells Falcon to use the MySQL JDBC connector, which is accessible 
@localhost:3306, with DB name 'falcon'.
+
+---++ Scheduling new entities on Native Scheduler
+To schedule an entity (currently only process is supported) using the native 
scheduler, you need to specify the scheduler in the schedule command as shown 
below:
+<verbatim>
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule 
-properties falcon.scheduler:native
+</verbatim>
+
+If Oozie is configured as the default scheduler, you can skip the scheduler 
option or explicitly set it to _oozie_, as shown below:
+<verbatim>
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule
+OR
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule 
-properties falcon.scheduler:oozie
+</verbatim>
+
+If the native scheduler is configured as the default scheduler, then, you can 
omit the scheduler option, as shown below:
+<verbatim>
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule 
+</verbatim>
+
+---++ Migrating entities from Oozie Scheduler to Native Scheduler
+Currently, user will have to delete and re-create entities in order to move 
across schedulers. Attempting to schedule an already scheduled entity on a 
different scheduler will result in an error. Note that the history of instances 
prior to scheduling on native scheduler will not be available via the instance 
APIs. However, user can retrieve that information using metadata APIs. Native 
scheduler must be enabled before migrating entities to native scheduler.
+
+<a href="#Configuring_Native_Scheduler">Configuring Native Scheduler</a> has 
more details on how to enable native scheduler.
+
+---+++ Migrating from Oozie to Native Scheduler
+   * Delete the entity (process). 
+<verbatim>$FALCON_HOME/bin/falcon entity -type process -name <process name> 
-delete </verbatim>
+   * Submit the entity (process) with start time from where the Oozie 
scheduler left off. 
+<verbatim>$FALCON_HOME/bin/falcon entity -type process -submit <path to 
process xml> </verbatim>
+   * Schedule the entity on native scheduler. 
+<verbatim> $FALCON_HOME/bin/falcon entity -type process -name <process name> 
-schedule -properties falcon.scheduler:native </verbatim>
+
+---+++ Reverting to Oozie from Native Scheduler
+   * Delete the entity (process). 
+<verbatim>$FALCON_HOME/bin/falcon entity -type process -name <process name> 
-delete </verbatim>
+   * Submit the entity (process) with start time from where the Native 
scheduler left off. 
+<verbatim>$FALCON_HOME/bin/falcon entity -type process -submit <path to 
process xml> </verbatim>
+   * Schedule the entity on the default scheduler (Oozie).
+ <verbatim> $FALCON_HOME/bin/falcon entity -type process -name <process name> 
-schedule </verbatim>

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/falconcli/Schedule.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/falconcli/Schedule.twiki 
b/docs/src/site/twiki/falconcli/Schedule.twiki
index 63aa9c1..c4422e7 100644
--- a/docs/src/site/twiki/falconcli/Schedule.twiki
+++ b/docs/src/site/twiki/falconcli/Schedule.twiki
@@ -13,10 +13,10 @@ Optional Args :
 
 -doAs <username>
 
--properties <<key1:val1,...,keyN:valN>>. Specifying 'falcon.scheduler:native' 
as a property will schedule the entity on the the native scheduler of Falcon. 
Else, it will default to the engine specified in startup.properties.
+-properties <<key1:val1,...,keyN:valN>>. Specifying 'falcon.scheduler:native' 
as a property will schedule the entity on the the native scheduler of Falcon. 
Else, it will default to the engine specified in startup.properties. For 
details on Native scheduler, refer to [[FalconNativeScheduler][Falcon Native 
Scheduler]]
 
 Examples:
 
  $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule
 
- $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule 
-properties falcon.scheduler:native
\ No newline at end of file
+ $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule 
-properties falcon.scheduler:native

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/src/conf/startup.properties
----------------------------------------------------------------------
diff --git a/src/conf/startup.properties b/src/conf/startup.properties
index 95d792b..ef0a2d5 100644
--- a/src/conf/startup.properties
+++ b/src/conf/startup.properties
@@ -51,10 +51,7 @@
                         org.apache.falcon.service.LogCleanupService,\
                         org.apache.falcon.service.GroupsService,\
                         org.apache.falcon.service.ProxyUserService
-## If you wish to use Falcon native scheduler uncomment out below  application 
services and
-# comment out above application
-#
-#services ##
+## If you wish to use Falcon native scheduler uncomment out below  application 
services and comment out above application services ##
 
#*.application.services=org.apache.falcon.security.AuthenticationInitializationService,\
 #                        
org.apache.falcon.workflow.WorkflowJobEndNotificationService, \
 #                        org.apache.falcon.service.ProcessSubscriberService,\
@@ -282,12 +279,13 @@ 
prism.configstore.listeners=org.apache.falcon.entity.v0.EntityGraph,\
 #*.falcon.state.store.impl=org.apache.falcon.state.store.jdbc.JDBCStateStore
 #*.falcon.statestore.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
 ## Falcon currently supports derby and mysql, change url based on DB.
-#*.falcon.statestore.jdbc.url=jdbc:derby:data/statestore.db;create=true
+#*.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db;create=true
 #*.falcon.statestore.jdbc.username=sa
 #*.falcon.statestore.jdbc.password=
 
#*.falcon.statestore.connection.data.source=org.apache.commons.dbcp.BasicDataSource
 ## Maximum number of active connections that can be allocated from this pool 
at the same time.
 #*.falcon.statestore.pool.max.active.conn=10
+## Any additional connection properties that need to be used, specified as 
comma separated key=value pairs.
 #*.falcon.statestore.connection.properties=
 ## Indicates the interval (in milliseconds) between eviction runs.
 #*.falcon.statestore.validate.db.connection.eviction.interval=300000

falcon git commit: FALCON-1562 Documentation for enabling native scheduler in falcon

Reply via email to