http://git-wip-us.apache.org/repos/asf/oozie/blob/6a6f2199/docs/src/site/twiki/WorkflowFunctionalSpec.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/WorkflowFunctionalSpec.twiki b/docs/src/site/twiki/WorkflowFunctionalSpec.twiki deleted file mode 100644 index 463635b..0000000 --- a/docs/src/site/twiki/WorkflowFunctionalSpec.twiki +++ /dev/null @@ -1,5457 +0,0 @@ - - -[::Go back to Oozie Documentation Index::](index.html) - ------ - -# Oozie Specification, a Hadoop Workflow System -**<center>(v5.0)</center>** - -The goal of this document is to define a workflow engine system specialized in coordinating the execution of Hadoop -Map/Reduce and Pig jobs. - - -<!-- MACRO{toc|fromDepth=1|toDepth=4} --> - -## Changelog - - -**2016FEB19** - - * 3.2.7 Updated notes on System.exit(int n) behavior - -**2015APR29** - - * 3.2.1.4 Added notes about Java action retries - * 3.2.7 Added notes about Java action retries - -**2014MAY08** - - * 3.2.2.4 Added support for fully qualified job-xml path - -**2013JUL03** - - * Appendix A, Added new workflow schema 0.5 and SLA schema 0.2 - -**2012AUG30** - - * 4.2.2 Added two EL functions (replaceAll and appendAll) - -**2012JUL26** - - * Appendix A, updated XML schema 0.4 to include `parameters` element - * 4.1 Updated to mention about `parameters` element as of schema 0.4 - -**2012JUL23** - - * Appendix A, updated XML schema 0.4 (Fs action) - * 3.2.4 Updated to mention that a `name-node`, a `job-xml`, and a `configuration` element are allowed in the Fs action as of -schema 0.4 - -**2012JUN19** - - * Appendix A, added XML schema 0.4 - * 3.2.2.4 Updated to mention that multiple `job-xml` elements are allowed as of schema 0.4 - * 3.2.3 Updated to mention that multiple `job-xml` elements are allowed as of schema 0.4 - -**2011AUG17** - - * 3.2.4 fs 'chmod' xml closing element typo in Example corrected - -**2011AUG12** - - * 3.2.4 fs 'move' action characteristics updated, to allow for consistent source and target paths and existing target path only if directory - * 18, Update the doc for user-retry of workflow action. - -**2011FEB19** - - * 10, Update the doc to rerun from the failed node. - -**2010OCT31** - - * 17, Added new section on Shared Libraries - -**2010APR27** - - * 3.2.3 Added new "arguments" tag to PIG actions - * 3.2.5 SSH actions are deprecated in Oozie schema 0.1 and removed in Oozie schema 0.2 - * Appendix A, Added schema version 0.2 - - -**2009OCT20** - - * Appendix A, updated XML schema - - -**2009SEP15** - - * 3.2.6 Removing support for sub-workflow in a different Oozie instance (removing the 'oozie' element) - - -**2009SEP07** - - * 3.2.2.3 Added Map Reduce Pipes specifications. - * 3.2.2.4 Map-Reduce Examples. Previously was 3.2.2.3. - - -**2009SEP02** - - * 10 Added missing skip nodes property name. - * 3.2.1.4 Reworded action recovery explanation. - - -**2009AUG26** - - * 3.2.9 Added `java` action type - * 3.1.4 Example uses EL constant to refer to counter group/name - - -**2009JUN09** - - * 12.2.4 Added build version resource to admin end-point - * 3.2.6 Added flag to propagate workflow configuration to sub-workflows - * 10 Added behavior for workflow job parameters given in the rerun - * 11.3.4 workflows info returns pagination information - - -**2009MAY18** - - * 3.1.4 decision node, 'default' element, 'name' attribute changed to 'to' - * 3.1.5 fork node, 'transition' element changed to 'start', 'to' attribute change to 'path' - * 3.1.5 join node, 'transition' element remove, added 'to' attribute to 'join' element - * 3.2.1.4 Rewording on action recovery section - * 3.2.2 map-reduce action, added 'job-tracker', 'name-node' actions, 'file', 'file' and 'archive' elements - * 3.2.2.1 map-reduce action, remove from 'streaming' element 'file', 'file' and 'archive' elements - * 3.2.2.2 map-reduce action, reorganized streaming section - * 3.2.3 pig action, removed information about implementation (SSH), changed elements names - * 3.2.4 fs action, removed 'fs-uri' and 'user-name' elements, file system URI is now specified in path, user is propagated - * 3.2.6 sub-workflow action, renamed elements 'oozie-url' to 'oozie' and 'workflow-app' to 'app-path' - * 4 Properties that are valid Java identifiers can be used as ${NAME} - * 4.1 Renamed default properties file from 'configuration.xml' to 'default-configuration.xml' - * 4.2 Changes in EL Constants and Functions - * 5 Updated notification behavior and tokens - * 6 Changed user propagation behavior - * 7 Changed application packaging from ZIP to HDFS directory - * Removed application lifecycle and self containment model sections - * 10 Changed workflow job recovery, simplified recovery behavior - * 11 Detailed Web Services API - * 12 Updated Client API section - * 15 Updated Action Executor API section - * Appendix A XML namespace updated to 'uri:oozie:workflow:0.1' - * Appendix A Updated XML schema to changes in map-reduce/pig/fs/ssh actions - * Appendix B Updated workflow example to schema changes - - -**2009MAR25** - - * Changing all references of HWS to Oozie (project name) - * Typos, XML Formatting - * XML Schema URI correction - - -**2009MAR09** - - * Changed `CREATED` job state to `PREP` to have same states as Hadoop - * Renamed 'hadoop-workflow' element to 'workflow-app' - * Decision syntax changed to be 'switch/case' with no transition indirection - * Action nodes common root element 'action', with the action type as sub-element (using a single built-in XML schema) - * Action nodes have 2 explicit transitions 'ok to' and 'error to' enforced by XML schema - * Renamed 'fail' action element to 'kill' - * Renamed 'hadoop' action element to 'map-reduce' - * Renamed 'hdfs' action element to 'fs' - * Updated all XML snippets and examples - * Made user propagation simpler and consistent - * Added Oozie XML schema to Appendix A - * Added workflow example to Appendix B - - -**2009FEB22** - - * Opened [JIRA HADOOP-5303](https://issues.apache.org/jira/browse/HADOOP-5303) - - -**27/DEC/2012:** - - * Added information on dropping hcatalog table partitions in prepare block - * Added hcatalog EL functions section - -## 0 Definitions - -**Action:** An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or -'action node'. - -**Workflow:** A collection of actions arranged in a control dependency DAG (Directed Acyclic Graph). "control dependency" -from one action to another means that the second action can't run until the first action has completed. - -**Workflow Definition:** A programmatic description of a workflow that can be executed. - -**Workflow Definition Language:** The language used to define a Workflow Definition. - -**Workflow Job:** An executable instance of a workflow definition. - -**Workflow Engine:** A system that executes workflows jobs. It can also be referred as a DAG engine. - -## 1 Specification Highlights - -A Workflow application is DAG that coordinates the following types of actions: Hadoop, Pig, and -sub-workflows. - -Flow control operations within the workflow applications can be done using decision, fork and join nodes. Cycles in -workflows are not supported. - -Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow -definition as `${VAR}` variables. - -A Workflow application is a ZIP file that contains the workflow definition (an XML file), all the necessary files to -run all the actions: JAR files for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Pig -scripts, and other resource files. - -Before running a workflow job, the corresponding workflow application must be deployed in Oozie. - -Deploying workflow application and running workflow jobs can be done via command line tools, a WS API and a Java API. - -Monitoring the system and workflow jobs can be done via a web console, command line tools, a WS API and a Java API. - -When submitting a workflow job, a set of properties resolving all the formal parameters in the workflow definitions -must be provided. This set of properties is a Hadoop configuration. - -Possible states for a workflow jobs are: `PREP`, `RUNNING`, `SUSPENDED`, `SUCCEEDED`, `KILLED` and `FAILED`. - -In the case of a action start failure in a workflow job, depending on the type of failure, Oozie will attempt automatic -retries, it will request a manual retry or it will fail the workflow job. - -Oozie can make HTTP callback notifications on action start/end/failure events and workflow end/failure events. - -In the case of workflow job failure, the workflow job can be resubmitted skipping previously completed actions. -Before doing a resubmission the workflow application could be updated with a patch to fix a problem in the workflow -application code. - -<a name="WorkflowDefinition"></a> -## 2 Workflow Definition - -A workflow definition is a DAG with control flow nodes (start, end, decision, fork, join, kill) or action nodes -(map-reduce, pig, etc.), nodes are connected by transitions arrows. - -The workflow definition language is XML based and it is called hPDL (Hadoop Process Definition Language). - -Refer to the Appendix A for the[Oozie Workflow Definition XML Schema](WorkflowFunctionalSpec.html#OozieWFSchema). Appendix -B has [Workflow Definition Examples](WorkflowFunctionalSpec.html#OozieWFExamples). - -### 2.1 Cycles in Workflow Definitions - -Oozie does not support cycles in workflow definitions, workflow definitions must be a strict DAG. - -At workflow application deployment time, if Oozie detects a cycle in the workflow definition it must fail the -deployment. - -## 3 Workflow Nodes - -Workflow nodes are classified in control flow nodes and action nodes: - - * **Control flow nodes:** nodes that control the start and end of the workflow and workflow job execution path. - * **Action nodes:** nodes that trigger the execution of a computation/processing task. - -Node names and transitions must be conform to the following pattern `[a-zA-Z][\-_a-zA-Z0-0]*`, of up to 20 characters -long. - -### 3.1 Control Flow Nodes - -Control flow nodes define the beginning and the end of a workflow (the `start`, `end` and `kill` nodes) and provide a -mechanism to control the workflow execution path (the `decision`, `fork` and `join` nodes). - -<a name="StartNode"></a> -#### 3.1.1 Start Control Node - -The `start` node is the entry point for a workflow job, it indicates the first workflow node the workflow job must -transition to. - -When a workflow is started, it automatically transitions to the node specified in the `start`. - -A workflow definition must have one `start` node. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <start to="[NODE-NAME]"/> - ... -</workflow-app> -``` - -The `to` attribute is the name of first workflow node to execute. - -**Example:** - - -``` -<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <start to="firstHadoopJob"/> - ... -</workflow-app> -``` - -<a name="EndNode"></a> -#### 3.1.2 End Control Node - -The `end` node is the end for a workflow job, it indicates that the workflow job has completed successfully. - -When a workflow job reaches the `end` it finishes successfully (SUCCEEDED). - -If one or more actions started by the workflow job are executing when the `end` node is reached, the actions will be -killed. In this scenario the workflow job is still considered as successfully run. - -A workflow definition must have one `end` node. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <end name="[NODE-NAME]"/> - ... -</workflow-app> -``` - -The `name` attribute is the name of the transition to do to end the workflow job. - -**Example:** - - -``` -<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <end name="end"/> -</workflow-app> -``` - -<a name="KillNode"></a> -#### 3.1.3 Kill Control Node - -The `kill` node allows a workflow job to kill itself. - -When a workflow job reaches the `kill` it finishes in error (KILLED). - -If one or more actions started by the workflow job are executing when the `kill` node is reached, the actions will be -killed. - -A workflow definition may have zero or more `kill` nodes. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <kill name="[NODE-NAME]"> - <message>[MESSAGE-TO-LOG]</message> - </kill> - ... -</workflow-app> -``` - -The `name` attribute in the `kill` node is the name of the Kill action node. - -The content of the `message` element will be logged as the kill reason for the workflow job. - -A `kill` node does not have transition elements because it ends the workflow job, as `KILLED`. - -**Example:** - - -``` -<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <kill name="killBecauseNoInput"> - <message>Input unavailable</message> - </kill> - ... -</workflow-app> -``` - -<a name="DecisionNode"></a> -#### 3.1.4 Decision Control Node - -A `decision` node enables a workflow to make a selection on the execution path to follow. - -The behavior of a `decision` node can be seen as a switch-case statement. - -A `decision` node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated -in order or appearance until one of them evaluates to `true` and the corresponding transition is taken. If none of the -predicates evaluates to `true` the `default` transition is taken. - -Predicates are JSP Expression Language (EL) expressions (refer to section 4.2 of this document) that resolve into a -boolean value, `true` or `false`. For example: - - -``` - ${fs:fileSize('/usr/foo/myinputdir') gt 10 * GB} -``` - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <decision name="[NODE-NAME]"> - <switch> - <case to="[NODE_NAME]">[PREDICATE]</case> - ... - <case to="[NODE_NAME]">[PREDICATE]</case> - <default to="[NODE_NAME]"/> - </switch> - </decision> - ... -</workflow-app> -``` - -The `name` attribute in the `decision` node is the name of the decision node. - -Each `case` elements contains a predicate and a transition name. The predicate ELs are evaluated -in order until one returns `true` and the corresponding transition is taken. - -The `default` element indicates the transition to take if none of the predicates evaluates -to `true`. - -All decision nodes must have a `default` element to avoid bringing the workflow into an error -state if none of the predicates evaluates to true. - -**Example:** - - -``` -<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <decision name="mydecision"> - <switch> - <case to="reconsolidatejob"> - ${fs:fileSize(secondjobOutputDir) gt 10 * GB} - </case> <case to="rexpandjob"> - ${fs:fileSize(secondjobOutputDir) lt 100 * MB} - </case> - <case to="recomputejob"> - ${ hadoop:counters('secondjob')[RECORDS][REDUCE_OUT] lt 1000000 } - </case> - <default to="end"/> - </switch> - </decision> - ... -</workflow-app> -``` - -<a name="ForkJoinNodes"></a> -#### 3.1.5 Fork and Join Control Nodes - -A `fork` node splits one path of execution into multiple concurrent paths of execution. - -A `join` node waits until every concurrent execution path of a previous `fork` node arrives to it. - -The `fork` and `join` nodes must be used in pairs. The `join` node assumes concurrent execution paths are children of -the same `fork` node. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <fork name="[FORK-NODE-NAME]"> - <path start="[NODE-NAME]" /> - ... - <path start="[NODE-NAME]" /> - </fork> - ... - <join name="[JOIN-NODE-NAME]" to="[NODE-NAME]" /> - ... -</workflow-app> -``` - -The `name` attribute in the `fork` node is the name of the workflow fork node. The `start` attribute in the `path` -elements in the `fork` node indicate the name of the workflow node that will be part of the concurrent execution paths. - -The `name` attribute in the `join` node is the name of the workflow join node. The `to` attribute in the `join` node -indicates the name of the workflow node that will executed after all concurrent execution paths of the corresponding -fork arrive to the join node. - -**Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <fork name="forking"> - <path start="firstparalleljob"/> - <path start="secondparalleljob"/> - </fork> - <action name="firstparallejob"> - <map-reduce> - <resource-manager>foo:8032</resource-manager> - <name-node>bar:8020</name-node> - <job-xml>job1.xml</job-xml> - </map-reduce> - <ok to="joining"/> - <error to="kill"/> - </action> - <action name="secondparalleljob"> - <map-reduce> - <resource-manager>foo:8032</resource-manager> - <name-node>bar:8020</name-node> - <job-xml>job2.xml</job-xml> - </map-reduce> - <ok to="joining"/> - <error to="kill"/> - </action> - <join name="joining" to="nextaction"/> - ... -</workflow-app> -``` - -By default, Oozie performs some validation that any forking in a workflow is valid and won't lead to any incorrect behavior or -instability. However, if Oozie is preventing a workflow from being submitted and you are very certain that it should work, you can -disable forkjoin validation so that Oozie will accept the workflow. To disable this validation just for a specific workflow, simply -set `oozie.wf.validate.ForkJoin` to `false` in the job.properties file. To disable this validation for all workflows, simply set -`oozie.validate.ForkJoin` to `false` in the oozie-site.xml file. Disabling this validation is determined by the AND of both of -these properties, so it will be disabled if either or both are set to false and only enabled if both are set to true (or not -specified). - -<a name="ActionNodes"></a> -### 3.2 Workflow Action Nodes - -Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. - -#### 3.2.1 Action Basis - -The following sub-sections define common behavior and capabilities for all action types. - -##### 3.2.1.1 Action Computation/Processing Is Always Remote - -All computation/processing tasks triggered by an action node are remote to Oozie. No workflow application specific -computation/processing task is executed within Oozie. - -##### 3.2.1.2 Actions Are Asynchronous - -All computation/processing tasks triggered by an action node are executed asynchronously by Oozie. For most types of -computation/processing tasks triggered by workflow action, the workflow job has to wait until the -computation/processing task completes before transitioning to the following node in the workflow. - -The exception is the `fs` action that is handled as a synchronous action. - -Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling. - -When a computation/processing tasks is started by Oozie, Oozie provides a unique callback URL to the task, the task -should invoke the given URL to notify its completion. - -For cases that the task failed to invoke the callback URL for any reason (i.e. a transient network failure) or when -the type of task cannot invoke the callback URL upon completion, Oozie has a mechanism to poll computation/processing -tasks for completion. - -##### 3.2.1.3 Actions Have 2 Transitions, `ok` and `error` - -If a computation/processing task -triggered by a workflow- completes successfully, it transitions to `ok`. - -If a computation/processing task -triggered by a workflow- fails to complete successfully, its transitions to `error`. - -If a computation/processing task exits in error, there computation/processing task must provide `error-code` and - `error-message` information to Oozie. This information can be used from `decision` nodes to implement a fine grain -error handling at workflow application level. - -Each action type must clearly define all the error codes it can produce. - -##### 3.2.1.4 Action Recovery - -Oozie provides recovery capabilities when starting or ending actions. - -Once an action starts successfully Oozie will not retry starting the action if the action fails during its execution. -The assumption is that the external system (i.e. Hadoop) executing the action has enough resilience to recover jobs -once it has started (i.e. Hadoop task retries). - -Java actions are a special case with regard to retries. Although Oozie itself does not retry Java actions -should they fail after they have successfully started, Hadoop itself can cause the action to be restarted due to a -map task retry on the map task running the Java application. See the Java Action section below for more detail. - -For failures that occur prior to the start of the job, Oozie will have different recovery strategies depending on the -nature of the failure. - -If the failure is of transient nature, Oozie will perform retries after a pre-defined time interval. The number of -retries and timer interval for a type of action must be pre-configured at Oozie level. Workflow jobs can override such -configuration. - -Examples of a transient failures are network problems or a remote system temporary unavailable. - -If the failure is of non-transient nature, Oozie will suspend the workflow job until an manual or programmatic -intervention resumes the workflow job and the action start or end is retried. It is the responsibility of an -administrator or an external managing system to perform any necessary cleanup before resuming the workflow job. - -If the failure is an error and a retry will not resolve the problem, Oozie will perform the error transition for the -action. - -<a name="MapReduceAction"></a> -#### 3.2.2 Map-Reduce Action - -The `map-reduce` action starts a Hadoop map/reduce job from a workflow. Hadoop jobs can be Java Map/Reduce jobs or -streaming jobs. - -A `map-reduce` action can be configured to perform file system cleanup and directory creation before starting the -map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop -checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry -without cleanup of the job output directory would fail). - -The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the -workflow execution path. - -The counters of the Hadoop job and job exit status (`FAILED`, `KILLED` or `SUCCEEDED`) must be available to the -workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions -configurations. - -The `map-reduce` action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop -map/reduce job. - -Hadoop JobConf properties can be specified as part of - - * the `config-default.xml` or - * JobConf XML file bundled with the workflow application or - * \<global\> tag in workflow definition or - * Inline `map-reduce` action configuration or - * An implementation of OozieActionConfigurator specified by the \<config-class\> tag in workflow definition. - -The configuration properties are loaded in the following above order i.e. `streaming`, `job-xml`, `configuration`, -and `config-class`, and the precedence order is later values override earlier values. - -Streaming and inline property values can be parameterized (templatized) using EL expressions. - -The Hadoop `mapred.job.tracker` and `fs.default.name` properties must not be present in the job-xml and inline -configuration. - -<a name="FilesArchives"></a> -##### 3.2.2.1 Adding Files and Archives for the Job - -The `file`, `archive` elements make available, to map-reduce jobs, files and archives. If the specified path is -relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. -If the path is absolute, the file or archive it is expected in the given absolute path. - -Files specified with the `file` element, will be symbolic links in the home directory of the task. - -If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running -directory, thus available to the task JVM. - -To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example -'mycat.sh#cat'. - -Refer to Hadoop distributed cache documentation for details more details on files and archives. - -##### 3.2.2.2 Configuring the MapReduce action with Java code - -Java code can be used to further configure the MapReduce action. This can be useful if you already have "driver" code for your -MapReduce action, if you're more familiar with MapReduce's Java API, if there's some configuration that requires logic, or some -configuration that's difficult to do in straight XML (e.g. Avro). - -Create a class that implements the org.apache.oozie.action.hadoop.OozieActionConfigurator interface from the "oozie-sharelib-oozie" -artifact. It contains a single method that receives a `JobConf` as an argument. Any configuration properties set on this `JobConf` -will be used by the MapReduce action. - -The OozieActionConfigurator has this signature: - -``` -public interface OozieActionConfigurator { - public void configure(JobConf actionConf) throws OozieActionConfiguratorException; -} -``` -where `actionConf` is the `JobConf` you can update. If you need to throw an Exception, you can wrap it in -an `OozieActionConfiguratorException`, also in the "oozie-sharelib-oozie" artifact. - -For example: - -``` -package com.example; - -import org.apache.hadoop.fs.Path; -import org.apache.hadoop.mapred.FileInputFormat; -import org.apache.hadoop.mapred.FileOutputFormat; -import org.apache.hadoop.mapred.JobConf; -import org.apache.oozie.action.hadoop.OozieActionConfigurator; -import org.apache.oozie.action.hadoop.OozieActionConfiguratorException; -import org.apache.oozie.example.SampleMapper; -import org.apache.oozie.example.SampleReducer; - -public class MyConfigClass implements OozieActionConfigurator { - - @Override - public void configure(JobConf actionConf) throws OozieActionConfiguratorException { - if (actionConf.getUser() == null) { - throw new OozieActionConfiguratorException("No user set"); - } - actionConf.setMapperClass(SampleMapper.class); - actionConf.setReducerClass(SampleReducer.class); - FileInputFormat.setInputPaths(actionConf, new Path("/user/" + actionConf.getUser() + "/input-data")); - FileOutputFormat.setOutputPath(actionConf, new Path("/user/" + actionConf.getUser() + "/output")); - ... - } -} -``` - -To use your config class in your MapReduce action, simply compile it into a jar, make the jar available to your action, and specify -the class name in the `config-class` element (this requires at least schema 0.5): - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <map-reduce> - ... - <job-xml>[JOB-XML-FILE]</job-xml> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - <config-class>com.example.MyConfigClass</config-class> - ... - </map-reduce> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -Another example of this can be found in the "map-reduce" example that comes with Oozie. - -A useful tip: The initial `JobConf` passed to the `configure` method includes all of the properties listed in the `configuration` -section of the MR action in a workflow. If you need to pass any information to your OozieActionConfigurator, you can simply put -them here. - -<a name="StreamingMapReduceAction"></a> -##### 3.2.2.3 Streaming - -Streaming information can be specified in the `streaming` element. - -The `mapper` and `reducer` elements are used to specify the executable/script to be used as mapper and reducer. - -User defined scripts must be bundled with the workflow application and they must be declared in the `files` element of -the streaming configuration. If the are not declared in the `files` element of the configuration it is assumed they -will be available (and in the command PATH) of the Hadoop slave machines. - -Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using -the `file` and `archive` elements described in the previous section. - -The Mapper/Reducer can be overridden by a `mapred.mapper.class` or `mapred.reducer.class` properties in the `job-xml` -file or `configuration` elements. - -<a name="PipesMapReduceAction"></a> -##### 3.2.2.4 Pipes - -Pipes information can be specified in the `pipes` element. - -A subset of the command line options which can be used while using the Hadoop Pipes Submitter can be specified -via elements - `map`, `reduce`, `inputformat`, `partitioner`, `writer`, `program`. - -The `program` element is used to specify the executable/script to be used. - -User defined program must be bundled with the workflow application. - -Some pipe jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using -the `file` and `archive` elements described in the previous section. - -Pipe properties can be overridden by specifying them in the `job-xml` file or `configuration` element. - -##### 3.2.2.5 Syntax - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <map-reduce> - <resource-manager>[RESOURCE-MANAGER]</resource-manager> - <name-node>[NAME-NODE]</name-node> - <prepare> - <delete path="[PATH]"/> - ... - <mkdir path="[PATH]"/> - ... - </prepare> - <streaming> - <mapper>[MAPPER-PROCESS]</mapper> - <reducer>[REDUCER-PROCESS]</reducer> - <record-reader>[RECORD-READER-CLASS]</record-reader> - <record-reader-mapping>[NAME=VALUE]</record-reader-mapping> - ... - <env>[NAME=VALUE]</env> - ... - </streaming> - <!-- Either streaming or pipes can be specified for an action, not both --> - <pipes> - <map>[MAPPER]</map> - <reduce>[REDUCER]</reducer> - <inputformat>[INPUTFORMAT]</inputformat> - <partitioner>[PARTITIONER]</partitioner> - <writer>[OUTPUTFORMAT]</writer> - <program>[EXECUTABLE]</program> - </pipes> - <job-xml>[JOB-XML-FILE]</job-xml> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - <config-class>com.example.MyConfigClass</config-class> - <file>[FILE-PATH]</file> - ... - <archive>[FILE-PATH]</archive> - ... - </map-reduce> - - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -The `prepare` element, if present, indicates a list of paths to delete before starting the job. This should be used -exclusively for directory cleanup or dropping of hcatalog table or table partitions for the job to be executed. The delete operation -will be performed in the `fs.default.name` filesystem for hdfs URIs. The format for specifying hcatalog table URI is -hcat://[metastore server]:[port]/[database name]/[table name] and format to specify a hcatalog table partition URI is -`hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]`. -In case of a hcatalog URI, the hive-site.xml needs to be shipped using `file` tag and the hcatalog and hive jars -need to be placed in workflow lib directory or specified using `archive` tag. - -The `job-xml` element, if present, must refer to a Hadoop JobConf `job.xml` file bundled in the workflow application. -By default the `job.xml` file is taken from the workflow application namenode, regardless the namenode specified for the action. -To specify a `job.xml` on another namenode use a fully qualified file path. -The `job-xml` element is optional and as of schema 0.4, multiple `job-xml` elements are allowed in order to specify multiple Hadoop JobConf `job.xml` files. - -The `configuration` element, if present, contains JobConf properties for the Hadoop job. - -Properties specified in the `configuration` element override properties specified in the file specified in the - `job-xml` element. - -As of schema 0.5, the `config-class` element, if present, contains a class that implements OozieActionConfigurator that can be used -to further configure the MapReduce job. - -Properties specified in the `config-class` class override properties specified in `configuration` element. - -External Stats can be turned on/off by specifying the property _oozie.action.external.stats.write_ as _true_ or _false_ in the configuration element of workflow.xml. The default value for this property is _false_. - -The `file` element, if present, must specify the target symbolic link for binaries by separating the original file and target with a # (file#target-sym-link). This is not required for libraries. - -The `mapper` and `reducer` process for streaming jobs, should specify the executable command with URL encoding. e.g. '%' should be replaced by '%25'. - -**Example:** - - -``` -<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="myfirstHadoopJob"> - <map-reduce> - <resource-manager>foo:8032</resource-manager> - <name-node>bar:8020</name-node> - <prepare> - <delete path="hdfs://foo:8020/usr/tucu/output-data"/> - </prepare> - <job-xml>/myfirstjob.xml</job-xml> - <configuration> - <property> - <name>mapred.input.dir</name> - <value>/usr/tucu/input-data</value> - </property> - <property> - <name>mapred.output.dir</name> - <value>/usr/tucu/input-data</value> - </property> - <property> - <name>mapred.reduce.tasks</name> - <value>${firstJobReducers}</value> - </property> - <property> - <name>oozie.action.external.stats.write</name> - <value>true</value> - </property> - </configuration> - </map-reduce> - <ok to="myNextAction"/> - <error to="errorCleanup"/> - </action> - ... -</workflow-app> -``` - -In the above example, the number of Reducers to be used by the Map/Reduce job has to be specified as a parameter of -the workflow job configuration when creating the workflow job. - -**Streaming Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="firstjob"> - <map-reduce> - <resource-manager>foo:8032</resource-manager> - <name-node>bar:8020</name-node> - <prepare> - <delete path="${output}"/> - </prepare> - <streaming> - <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper> - <reducer>/bin/bash testarchive/bin/reducer.sh</reducer> - </streaming> - <configuration> - <property> - <name>mapred.input.dir</name> - <value>${input}</value> - </property> - <property> - <name>mapred.output.dir</name> - <value>${output}</value> - </property> - <property> - <name>stream.num.map.output.key.fields</name> - <value>3</value> - </property> - </configuration> - <file>/users/blabla/testfile.sh#testfile</file> - <archive>/users/blabla/testarchive.jar#testarchive</archive> - </map-reduce> - <ok to="end"/> - <error to="kill"/> - </action> - ... -</workflow-app> -``` - - -**Pipes Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="firstjob"> - <map-reduce> - <resource-manager>foo:8032</resource-manager> - <name-node>bar:8020</name-node> - <prepare> - <delete path="${output}"/> - </prepare> - <pipes> - <program>bin/wordcount-simple#wordcount-simple</program> - </pipes> - <configuration> - <property> - <name>mapred.input.dir</name> - <value>${input}</value> - </property> - <property> - <name>mapred.output.dir</name> - <value>${output}</value> - </property> - </configuration> - <archive>/users/blabla/testarchive.jar#testarchive</archive> - </map-reduce> - <ok to="end"/> - <error to="kill"/> - </action> - ... -</workflow-app> -``` - - -<a name="PigAction"></a> -#### 3.2.3 Pig Action - -The `pig` action starts a Pig job. - -The workflow job will wait until the pig job completes before continuing to the next action. - -The `pig` action has to be configured with the resource-manager, name-node, pig script and the necessary parameters and -configuration to run the Pig job. - -A `pig` action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup before -starting the Pig job. This capability enables Oozie to retry a Pig job in the situation of a transient failure (Pig -creates temporary directories for intermediate data, thus a retry without cleanup would fail). - -Hadoop JobConf properties can be specified as part of - - * the `config-default.xml` or - * JobConf XML file bundled with the workflow application or - * \<global\> tag in workflow definition or - * Inline `pig` action configuration. - -The configuration properties are loaded in the following above order i.e. `job-xml` and `configuration`, and -the precedence order is later values override earlier values. - -Inline property values can be parameterized (templatized) using EL expressions. - -The YARN `yarn.resourcemanager.address` and HDFS `fs.default.name` properties must not be present in the job-xml and inline -configuration. - -As with Hadoop map-reduce jobs, it is possible to add files and archives to be available to the Pig job, refer to -section [#FilesArchives][Adding Files and Archives for the Job]. - - -**Syntax for Pig actions in Oozie schema 1.0:** - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <pig> - <resource-manager>[RESOURCE-MANAGER]</resource-manager> - <name-node>[NAME-NODE]</name-node> - <prepare> - <delete path="[PATH]"/> - ... - <mkdir path="[PATH]"/> - ... - </prepare> - <job-xml>[JOB-XML-FILE]</job-xml> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - <script>[PIG-SCRIPT]</script> - <param>[PARAM-VALUE]</param> - ... - <param>[PARAM-VALUE]</param> - <argument>[ARGUMENT-VALUE]</argument> - ... - <argument>[ARGUMENT-VALUE]</argument> - <file>[FILE-PATH]</file> - ... - <archive>[FILE-PATH]</archive> - ... - </pig> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -**Syntax for Pig actions in Oozie schema 0.2:** - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.2"> - ... - <action name="[NODE-NAME]"> - <pig> - <job-tracker>[JOB-TRACKER]</job-tracker> - <name-node>[NAME-NODE]</name-node> - <prepare> - <delete path="[PATH]"/> - ... - <mkdir path="[PATH]"/> - ... - </prepare> - <job-xml>[JOB-XML-FILE]</job-xml> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - <script>[PIG-SCRIPT]</script> - <param>[PARAM-VALUE]</param> - ... - <param>[PARAM-VALUE]</param> - <argument>[ARGUMENT-VALUE]</argument> - ... - <argument>[ARGUMENT-VALUE]</argument> - <file>[FILE-PATH]</file> - ... - <archive>[FILE-PATH]</archive> - ... - </pig> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -**Syntax for Pig actions in Oozie schema 0.1:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1"> - ... - <action name="[NODE-NAME]"> - <pig> - <job-tracker>[JOB-TRACKER]</job-tracker> - <name-node>[NAME-NODE]</name-node> - <prepare> - <delete path="[PATH]"/> - ... - <mkdir path="[PATH]"/> - ... - </prepare> - <job-xml>[JOB-XML-FILE]</job-xml> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - <script>[PIG-SCRIPT]</script> - <param>[PARAM-VALUE]</param> - ... - <param>[PARAM-VALUE]</param> - <file>[FILE-PATH]</file> - ... - <archive>[FILE-PATH]</archive> - ... - </pig> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -The `prepare` element, if present, indicates a list of paths to delete before starting the job. This should be used -exclusively for directory cleanup or dropping of hcatalog table or table partitions for the job to be executed. The delete operation -will be performed in the `fs.default.name` filesystem for hdfs URIs. The format for specifying hcatalog table URI is -hcat://[metastore server]:[port]/[database name]/[table name] and format to specify a hcatalog table partition URI is -`hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]`. -In case of a hcatalog URI, the hive-site.xml needs to be shipped using `file` tag and the hcatalog and hive jars -need to be placed in workflow lib directory or specified using `archive` tag. - -The `job-xml` element, if present, must refer to a Hadoop JobConf `job.xml` file bundled in the workflow application. -The `job-xml` element is optional and as of schema 0.4, multiple `job-xml` elements are allowed in order to specify multiple Hadoop JobConf `job.xml` files. - -The `configuration` element, if present, contains JobConf properties for the underlying Hadoop jobs. - -Properties specified in the `configuration` element override properties specified in the file specified in the - `job-xml` element. - -External Stats can be turned on/off by specifying the property _oozie.action.external.stats.write_ as _true_ or _false_ in the configuration element of workflow.xml. The default value for this property is _false_. - -The inline and job-xml configuration properties are passed to the Hadoop jobs submitted by Pig runtime. - -The `script` element contains the pig script to execute. The pig script can be templatized with variables of the -form `${VARIABLE}`. The values of these variables can then be specified using the `params` element. - -NOTE: Oozie will perform the parameter substitution before firing the pig job. This is different from the -[parameter substitution mechanism provided by Pig](http://wiki.apache.org/pig/ParameterSubstitution), which has a -few limitations. - -The `params` element, if present, contains parameters to be passed to the pig script. - -**In Oozie schema 0.2:** -The `arguments` element, if present, contains arguments to be passed to the pig script. - - -All the above elements can be parameterized (templatized) using EL expressions. - -**Example for Oozie schema 0.2:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.2"> - ... - <action name="myfirstpigjob"> - <pig> - <job-tracker>foo:8021</job-tracker> - <name-node>bar:8020</name-node> - <prepare> - <delete path="${jobOutput}"/> - </prepare> - <configuration> - <property> - <name>mapred.compress.map.output</name> - <value>true</value> - </property> - <property> - <name>oozie.action.external.stats.write</name> - <value>true</value> - </property> - </configuration> - <script>/mypigscript.pig</script> - <argument>-param</argument> - <argument>INPUT=${inputDir}</argument> - <argument>-param</argument> - <argument>OUTPUT=${outputDir}/pig-output3</argument> - </pig> - <ok to="myotherjob"/> - <error to="errorcleanup"/> - </action> - ... -</workflow-app> -``` - - -**Example for Oozie schema 0.1:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> - ... - <action name="myfirstpigjob"> - <pig> - <job-tracker>foo:8021</job-tracker> - <name-node>bar:8020</name-node> - <prepare> - <delete path="${jobOutput}"/> - </prepare> - <configuration> - <property> - <name>mapred.compress.map.output</name> - <value>true</value> - </property> - </configuration> - <script>/mypigscript.pig</script> - <param>InputDir=/home/tucu/input-data</param> - <param>OutputDir=${jobOutput}</param> - </pig> - <ok to="myotherjob"/> - <error to="errorcleanup"/> - </action> - ... -</workflow-app> -``` - -<a name="FsAction"></a> -#### 3.2.4 Fs (HDFS) action - -The `fs` action allows to manipulate files and directories in HDFS from a workflow application. The supported commands -are `move`, `delete`, `mkdir`, `chmod`, `touchz`, `setrep` and `chgrp`. - -The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified -file commands are completed before continuing to the next action. - -Path names specified in the `fs` action can be parameterized (templatized) using EL expressions. -Path name should be specified as a absolute path. In case of `move`, `delete`, `chmod` and `chgrp` commands, a glob pattern can also be specified instead of an absolute path. -For `move`, glob pattern can only be specified for source path and not the target. - -Each file path must specify the file system URI, for move operations, the target must not specify the system URI. - -**IMPORTANT:** For the purposes of copying files within a cluster it is recommended to refer to the `distcp` action -instead. Refer to [`distcp`](DG_DistCpActionExtension.html) action to copy files within a cluster. - -**IMPORTANT:** All the commands within `fs` action do not happen atomically, if a `fs` action fails half way in the -commands being executed, successfully executed commands are not rolled back. The `fs` action, before executing any -command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the `move` action. See below for details), thus failing before executing any command. -Therefore the validity of all paths specified in one `fs` action are evaluated before any of the file operation are -executed. Thus there is less chance of an error occurring while the `fs` action executes. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <fs> - <delete path='[PATH]' skip-trash='[true/false]'/> - ... - <mkdir path='[PATH]'/> - ... - <move source='[SOURCE-PATH]' target='[TARGET-PATH]'/> - ... - <chmod path='[PATH]' permissions='[PERMISSIONS]' dir-files='false' /> - ... - <touchz path='[PATH]' /> - ... - <chgrp path='[PATH]' group='[GROUP]' dir-files='false' /> - ... - <setrep path='[PATH]' replication-factor='2'/> - </fs> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -The `delete` command deletes the specified path, if it is a directory it deletes recursively all its content and then -deletes the directory. By default it does skip trash. It can be moved to trash by setting the value of skip-trash to -'false'. It can also be used to drop hcat tables/partitions. This is the only FS command which supports HCatalog URIs as well. -For eg: - -``` -<delete path='hcat://[metastore server]:[port]/[database name]/[table name]'/> -OR -<delete path='hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value];...'/> -``` - -The `mkdir` command creates the specified directory, it creates all missing directories in the path. If the directory -already exist it does a no-op. - -In the `move` command the `source` path must exist. The following scenarios are addressed for a `move`: - - * The file system URI(e.g. `hdfs://{nameNode}`) can be skipped in the `target` path. It is understood to be the same as that of the source. But if the target path does contain the system URI, it cannot be different than that of the source. - * The parent directory of the `target` path must exist - * For the `target` path, if it is a file, then it must not already exist. - * However, if the `target` path is an already existing directory, the `move` action will place your `source` as a child of the `target` directory. - -The `chmod` command changes the permissions for the specified path. Permissions can be specified using the Unix Symbolic -representation (e.g. -rwxrw-rw-) or an octal representation (755). -When doing a `chmod` command on a directory, by default the command is applied to the directory and the files one level -within the directory. To apply the `chmod` command to the directory, without affecting the files within it, -the `dir-files` attribute must be set to `false`. To apply the `chmod` command -recursively to all levels within a directory, put a `recursive` element inside the \<chmod\> element. - -The `touchz` command creates a zero length file in the specified path if none exists. If one already exists, then touchz will perform a touch operation. -Touchz works only for absolute paths. - -The `chgrp` command changes the group for the specified path. -When doing a `chgrp` command on a directory, by default the command is applied to the directory and the files one level -within the directory. To apply the `chgrp` command to the directory, without affecting the files within it, -the `dir-files` attribute must be set to `false`. -To apply the `chgrp` command recursively to all levels within a directory, put a `recursive` element inside the \<chgrp\> element. - -The `setrep` command changes replication factor of an hdfs file(s). Changing RF of directories or symlinks is not -supported; this action requires an argument for RF. - -**Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="hdfscommands"> - <fs> - <delete path='hdfs://foo:8020/usr/tucu/temp-data'/> - <mkdir path='archives/${wf:id()}'/> - <move source='${jobInput}' target='archives/${wf:id()}/processed-input'/> - <chmod path='${jobOutput}' permissions='-rwxrw-rw-' dir-files='true'><recursive/></chmod> - <chgrp path='${jobOutput}' group='testgroup' dir-files='true'><recursive/></chgrp> - <setrep path='archives/${wf:id()/filename(s)}' replication-factor='2'/> - </fs> - <ok to="myotherjob"/> - <error to="errorcleanup"/> - </action> - ... -</workflow-app> -``` - -In the above example, a directory named after the workflow job ID is created and the input of the job, passed as -workflow configuration parameter, is archived under the previously created directory. - -As of schema 0.4, if a `name-node` element is specified, then it is not necessary for any of the paths to start with the file system -URI as it is taken from the `name-node` element. This is also true if the name-node is specified in the global section -(see [Global Configurations](WorkflowFunctionalSpec.html#GlobalConfigurations)) - -As of schema 0.4, zero or more `job-xml` elements can be specified; these must refer to Hadoop JobConf `job.xml` formatted files -bundled in the workflow application. They can be used to set additional properties for the FileSystem instance. - -As of schema 0.4, if a `configuration` element is specified, then it will also be used to set additional JobConf properties for the -FileSystem instance. Properties specified in the `configuration` element override properties specified in the files specified -by any `job-xml` elements. - -**Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.4"> - ... - <action name="hdfscommands"> - <fs> - <name-node>hdfs://foo:8020</name-node> - <job-xml>fs-info.xml</job-xml> - <configuration> - <property> - <name>some.property</name> - <value>some.value</value> - </property> - </configuration> - <delete path='/usr/tucu/temp-data'/> - </fs> - <ok to="myotherjob"/> - <error to="errorcleanup"/> - </action> - ... -</workflow-app> -``` - -<a name="SubWorkflowAction"></a> -#### 3.2.5 Sub-workflow Action - -The `sub-workflow` action runs a child workflow job, the child workflow job can be in the same Oozie system or in -another Oozie system. - -The parent workflow job will wait until the child workflow job has completed. - -There can be several sub-workflows defined within a single workflow, each under its own action element. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <sub-workflow> - <app-path>[WF-APPLICATION-PATH]</app-path> - <propagate-configuration/> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - </sub-workflow> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -The child workflow job runs in the same Oozie system instance where the parent workflow job is running. - -The `app-path` element specifies the path to the workflow application of the child workflow job. - -The `propagate-configuration` flag, if present, indicates that the workflow job configuration should be propagated to -the child workflow. - -The `configuration` section can be used to specify the job properties that are required to run the child workflow job. - -The configuration of the `sub-workflow` action can be parameterized (templatized) using EL expressions. - -**Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="a"> - <sub-workflow> - <app-path>child-wf</app-path> - <configuration> - <property> - <name>input.dir</name> - <value>${wf:id()}/second-mr-output</value> - </property> - </configuration> - </sub-workflow> - <ok to="end"/> - <error to="kill"/> - </action> - ... -</workflow-app> -``` - -In the above example, the workflow definition with the name `child-wf` will be run on the Oozie instance at - `.http://myhost:11000/oozie`. The specified workflow application must be already deployed on the target Oozie instance. - -A configuration parameter `input.dir` is being passed as job property to the child workflow job. - -The subworkflow can inherit the lib jars from the parent workflow by setting `oozie.subworkflow.classpath.inheritance` to true -in oozie-site.xml or on a per-job basis by setting `oozie.wf.subworkflow.classpath.inheritance` to true in a job.properties file. -If both are specified, `oozie.wf.subworkflow.classpath.inheritance` has priority. If the subworkflow and the parent have -conflicting jars, the subworkflow's jar has priority. By default, `oozie.wf.subworkflow.classpath.inheritance` is set to false. - -To prevent errant workflows from starting infinitely recursive subworkflows, `oozie.action.subworkflow.max.depth` can be specified -in oozie-site.xml to set the maximum depth of subworkflow calls. For example, if set to 3, then a workflow can start subwf1, which -can start subwf2, which can start subwf3; but if subwf3 tries to start subwf4, then the action will fail. The default is 50. - -<a name="JavaAction"></a> -#### 3.2.6 Java Action - -The `java` action will execute the `public static void main(String[] args)` method of the specified main Java class. - -Java applications are executed in the Hadoop cluster as map-reduce job with a single Mapper task. - -The workflow job will wait until the java application completes its execution before continuing to the next action. - -The `java` action has to be configured with the resource-manager, name-node, main Java class, JVM options and arguments. - -To indicate an `ok` action transition, the main Java class must complete gracefully the `main` method invocation. - -To indicate an `error` action transition, the main Java class must throw an exception. - -The main Java class can call `System.exit(int n)`. Exit code zero is regarded as OK, while non-zero exit codes will -cause the `java` action to do an `error` transition and exit. - -A `java` action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup before -starting the Java application. This capability enables Oozie to retry a Java application in the situation of a transient -or non-transient failure (This can be used to cleanup any temporary data which may have been created by the Java -application in case of failure). - -A `java` action can create a Hadoop configuration for interacting with a cluster (e.g. launching a map-reduce job). -Oozie prepares a Hadoop configuration file which includes the environments site configuration files (e.g. hdfs-site.xml, -mapred-site.xml, etc) plus the properties added to the `<configuration>` section of the `java` action. The Hadoop configuration -file is made available as a local file to the Java application in its running directory. It can be added to the `java` actions -Hadoop configuration by referencing the system property: `oozie.action.conf.xml`. For example: - - -``` -// loading action conf prepared by Oozie -Configuration actionConf = new Configuration(false); -actionConf.addResource(new Path("file:///", System.getProperty("oozie.action.conf.xml"))); -``` - -If `oozie.action.conf.xml` is not added then the job will pick up the mapred-default properties and this may result -in unexpected behaviour. For repeated configuration properties later values override earlier ones. - -Inline property values can be parameterized (templatized) using EL expressions. - -The YARN `yarn.resourcemanager.address` (`resource-manager`) and HDFS `fs.default.name` (`name-node`) properties must not be present -in the `job-xml` and in the inline configuration. - -As with `map-reduce` and `pig` actions, it is possible to add files and archives to be available to the Java -application. Refer to section [#FilesArchives][Adding Files and Archives for the Job]. - -The `capture-output` element can be used to propagate values back into Oozie context, which can then be accessed via -EL-functions. This needs to be written out as a java properties format file. The filename is obtained via a System -property specified by the constant `oozie.action.output.properties` - -**IMPORTANT:** In order for a Java action to succeed on a secure cluster, it must propagate the Hadoop delegation token like in the -following code snippet (this is benign on non-secure clusters): - -``` -// propagate delegation related props from launcher job to MR job -if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) { - jobConf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION")); -} -``` - -**IMPORTANT:** Because the Java application is run from within a Map-Reduce job, from Hadoop 0.20. onwards a queue must -be assigned to it. The queue name must be specified as a configuration property. - -**IMPORTANT:** The Java application from a Java action is executed in a single map task. If the task is abnormally terminated, -such as due to a TaskTracker restart (e.g. during cluster maintenance), the task will be retried via the normal Hadoop task -retry mechanism. To avoid workflow failure, the application should be written in a fashion that is resilient to such retries, -for example by detecting and deleting incomplete outputs or picking back up from complete outputs. Furthermore, if a Java action -spawns asynchronous activity outside the JVM of the action itself (such as by launching additional MapReduce jobs), the -application must consider the possibility of collisions with activity spawned by the new instance. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <java> - <resource-manager>[RESOURCE-MANAGER]</resource-manager> - <name-node>[NAME-NODE]</name-node> - <prepare> - <delete path="[PATH]"/> - ... - <mkdir path="[PATH]"/> - ... - </prepare> - <job-xml>[JOB-XML]</job-xml> - <configuration> - <property> - <name>[PROPERTY-NAME]</name> - <value>[PROPERTY-VALUE]</value> - </property> - ... - </configuration> - <main-class>[MAIN-CLASS]</main-class> - <java-opts>[JAVA-STARTUP-OPTS]</java-opts> - <arg>ARGUMENT</arg> - ... - <file>[FILE-PATH]</file> - ... - <archive>[FILE-PATH]</archive> - ... - <capture-output /> - </java> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -The `prepare` element, if present, indicates a list of paths to delete before starting the Java application. This should -be used exclusively for directory cleanup or dropping of hcatalog table or table partitions for the Java application to be executed. -In case of `delete`, a glob pattern can be used to specify path. -The format for specifying hcatalog table URI is -hcat://[metastore server]:[port]/[database name]/[table name] and format to specify a hcatalog table partition URI is -`hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]`. -In case of a hcatalog URI, the hive-site.xml needs to be shipped using `file` tag and the hcatalog and hive jars -need to be placed in workflow lib directory or specified using `archive` tag. - -The `java-opts` and `java-opt` elements, if present, contains the command line parameters which are to be used to start the JVM that -will execute the Java application. Using this element is equivalent to using the `mapred.child.java.opts` -or `mapreduce.map.java.opts` configuration properties, with the advantage that Oozie will append to these properties instead of -simply setting them (e.g. if you have one of these properties specified in mapred-site.xml, setting it again in -the `configuration` element will override it, but using `java-opts` or `java-opt` will instead append to it, preserving the original -value). You can have either one `java-opts`, multiple `java-opt`, or neither; you cannot use both at the same time. In any case, -Oozie will set both `mapred.child.java.opts` and `mapreduce.map.java.opts` to the same value based on combining them. In other -words, after Oozie is finished: - -``` -mapred.child.java.opts <-- "<mapred.child.java.opts> <mapreduce.map.java.opts> <java-opt...|java-opts>" -mapreduce.map.java.opts <-- "<mapred.child.java.opts> <mapreduce.map.java.opts> <java-opt...|java-opts>" -``` -In the case that parameters are repeated, the latest instance of the parameter is used by Java. This means that `java-opt` -(or `java-opts`) has the highest priority, followed by `mapreduce.map.java.opts`, and finally `mapred.child.java.opts`. When -multiple `java-opt` are specified, they are included from top to bottom ordering, where the bottom has highest priority. - -The `arg` elements, if present, contains arguments for the main function. The value of each `arg` element is considered -a single argument and they are passed to the `main` method in the same order. - -All the above elements can be parameterized (templatized) using EL expressions. - -**Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="myfirstjavajob"> - <java> - <resource-manager>foo:8032</resource-manager> - <name-node>bar:8020</name-node> - <prepare> - <delete path="${jobOutput}"/> - </prepare> - <configuration> - <property> - <name>mapred.queue.name</name> - <value>default</value> - </property> - </configuration> - <main-class>org.apache.oozie.MyFirstMainClass</main-class> - <java-opts>-Dblah</java-opts> - <arg>argument1</arg> - <arg>argument2</arg> - </java> - <ok to="myotherjob"/> - <error to="errorcleanup"/> - </action> - ... -</workflow-app> -``` - -##### 3.2.6.1 Overriding an action's Main class - -This feature is useful for developers to change the Main classes without having to recompile or redeploy Oozie. - -For most actions (not just the Java action), you can override the Main class it uses by specifying the following `configuration` -property and making sure that your class is included in the workflow's classpath. If you specify this in the Java action, -the main-class element has priority. - - -``` -<property> - <name>oozie.launcher.action.main.class</name> - <value>org.my.CustomMain</value> -</property> -``` - -**Note:** Most actions typically pass information to their corresponding Main in specific ways; you should look at the action's -existing Main to see how it works before creating your own. In fact, its probably simplest to just subclass the existing Main and -add/modify/overwrite any behavior you want to change. - -<a name="GitAction"></a> -#### 3.2.7 Git action - -The `git` action allows one to clone a Git repository into HDFS. The supported options are `git-uri`, `branch`, `key-path` -and `destination-uri`. - -The `git clone` action is executed asynchronously by one of the YARN containers assigned to run on the cluster. If an SSH key is -specified it will be created on the file system in a YARN container's local directory, relying on YARN NodeManager to remove the -file after the action has run. - -Path names specified in the `git` action should be able to be parameterized (templatized) using EL expressions, -e.g. `${wf:user()}` . Path name should be specified as an absolute path. Each file path must specify the file system URI. - -**Syntax:** - - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <git> - <git-uri>[SOURCE-URI]</git-uri> - ... - <branch>[BRANCH]</branch> - ... - <key-path>[HDFS-PATH]</key-path> - ... - <destination-uri>[HDFS-PATH]</destination-uri> - </git> - <ok to="[NODE-NAME]"/> - <error to="[NODE-NAME]"/> - </action> - ... -</workflow-app> -``` - -**Example:** - - -``` -<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> - ... - <action name="clone_oozie"> - <git> - <git-uri>https://github.com/apache/oozie</git-uri> - <destination-uri>hdfs://my_git_repo_directory</destination-uri> - </git> - <ok to="myotherjob"/> - <error to="errorcleanup"/> - </action> - ... -</workflow-app> -``` - -In the above example, a Git repository on e.g. GitHub.com is cloned to the HDFS directory `my_git_repo_directory` which should not -exist previously on the filesystem. Note that repository addresses outside of GitHub.com but accessible to the YARN container -running the Git action may also be used. - -If a `name-node` element is specified, then it is not necessary for any of the paths to start with the file system URI as it is -taken from the `name-node` element. - -The `resource-manager` (Oozie 5.x) element has to be specified to name the YARN ResourceManager address. - -If any of the paths need to be served from another HDFS namenode, its address has to be part of -that filesystem URI prefix: - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <git> - ... - <name-node>hdfs://name-node.first.company.com:8020</name-node> - ... - <key-path>hdfs://name-node.second.company.com:8020/[HDFS-PATH]</key-path> - ... - </git> - ... - </action> - ... -</workflow-app> -``` - -This is also true if the name-node is specified in the global section (see -[Global Configurations](WorkflowFunctionalSpec.html#GlobalConfigurations)). - -Be aware that `key-path` might point to a secure object store location other than the current `fs.defaultFS`. In that case, -appropriate file permissions are still necessary (readable by submitting user), credentials provided, etc. - -As of workflow schema 1.0, zero or more `job-xml` elements can be specified; these must refer to Hadoop JobConf `job.xml` formatted -files bundled in the workflow application. They can be used to set additional properties for the `FileSystem` instance. - -As of schema workflow schema 1.0, if a `configuration` element is specified, then it will also be used to set additional `JobConf` -properties for the `FileSystem` instance. Properties specified in the `configuration` element are overridden by properties -specified in the files specified by any `job-xml` elements. - -**Example:** - -``` -<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> - ... - <action name="[NODE-NAME]"> - <git> - ... - <name-node>hdfs://foo:8020</name-node> - <job-xml>fs-info.xml</job-xml> - <configuration> - <property> - <name>some.property</name> - <value>some.value</value> - </property> - </configuration> - </git> - ... - </action> - ... -</workflow> -``` - -<a name="WorkflowParameterization"></a> -## 4 Parameterization of Workflows - -Workflow definitions can be parameterized. - -When workflow node is executed by Oozie all the ELs are resolved into concrete values. - -The parameterization of workflow definitions it done using JSP Expression Language syntax from the -[JSP 2.0 Specification (JSP.2.3)](http://jcp.org/aboutJava/communityprocess/final/jsr152/index.html), allowing not only to -support variables as parameters but also functions and complex expressions. - -EL expressions can be used in the configuration values of action and decision nodes. They can be used in XML attribute -values and in XML element and attribute values. - -They cannot be used in XML element and attribute names. They cannot be used in the name of a node and they cannot be -used within the `transition` elements of a node. - -<a name="WorkflowProperties"></a> -### 4.1 Workflow Job Properties (or Parameters) - -When a workflow job is submitted to Oozie, the submitter may specify as many workflow job properties as required -(similar to Hadoop JobConf properties). - -Workflow applications may define default values for the workflow job or action parameters. They must be defined in a - `config-default.xml` file bundled with the workflow application archive (refer to section '7 Workflow - Applications Packaging'). Job or action properties specified in the workflow definition have precedence over the default values. - -Properties that are a valid Java identifier, `[A-Za-z_][0-9A-Za-z_]*`, are available as '${NAME}' -variables within the workflow definition. - -Properties that are not valid Java Identifier, for example 'job.tracker', are available via the `String -wf:conf(String name)` function. Valid identifier properties are available via this function as well. - -Using properties that are valid Java identifiers result in a more readable and compact definition. - -By using properties -**Example:** - -Parameterized Workflow definition: - - -``` -<workflow-app name='hello-wf' xmlns="uri:oozie:workflow:1.0"> - ... - <action name='firstjob'> - <map-reduce> - <resource-manager>${resourceManager}</resource-manager> - <name-node>${nameNode}</name-node> - <configuration> - <property> - <name>mapred.mapper.class</name> - <value>com.foo.FirstMapper</value> - </property> - <property> - <name>mapred.reducer.class</name> - <value>com.foo.FirstReducer</value> - </property> - <property> - <name>mapred.input.dir</name> - <value>${inputDir}</value> - </property> - <property> - <name>mapred.output.dir</name> - <value>${outputDir}</value> - </property> - </configuration> - </map-reduce> - <ok to='secondjob'/> - <error to='killcleanup'/> - </action> - ... -</workflow-app> -``` - -When submitting a workflow job for the workflow definition above, 3 workflow job properties must be specified: - - * `resourceManager:` - * `inputDir:` - * `outputDir:` - -If the above 3 properties are not specified, the job will fail. - -As of schema 0.4, a list of formal parameters can be provided which will allow Oozie to verify, at submission time, that said -properties are actually specified (i.e. before the job is executed and fails). Default values can also be provided. - -**Example:** - -The previous parameterized workflow definition with formal parameters: - - -``` -<workflow-app name='hello-wf' xmlns="uri:oozie:workflow:1.0"> - <parameters> - <property> - <name>inputDir</name> - </property> - <property> - <name>outputDir</name> - <value>out-dir</value> - </property> - </parameters> - ... - <action name='firstjob'> - <map-reduce> - <resource-manager>${resourceManager}</resource-manager> - <name-node>${nameNode}</name-node> - <configuration> - <property> - <name>mapred.mapper.class</name> - <value>com.foo.FirstMapper</value> - </property> - <property> - <name>mapred.reducer.class</name> - <value>com.foo.FirstReducer</value> - </property> - <property> - <name>mapred.input.dir</name> - <value>${inputDir}</value> - </property> - <property> - <name>mapred.output.dir</name> - <value>${outputDir}</value> - </property> - </configuration> - </map-reduce> - <ok to='secondjob'/> - <error to='killcleanup'/> - </action> - ... -</workflow-app> -``` - -In the above example, if `inputDir` is not specified, Oozie will print an error message instead of submitting the job. If -`outputDir` is not specified, Oozie will use the default value, `out-dir`. - -<a name="WorkflowELSupport"></a> -### 4.2 Expression Language Functions - -Oozie, besides allowing the use of workflow job properties to parameterize workflow jobs, it provides a set of build -in EL functions that enable a more complex parameterization of workflow action nodes as well as the predicates in -decision nodes. - -#### 4.2.1 Basic EL Constants - - * **KB:** 1024, one kilobyte. - * **MB:** 1024 *** KB, one megabyte. - * **GB:** 1024 *** MB, one gigabyte. - * **TB:** 1024 *** GB, one terabyte. - * **PB:** 1024 *** TG, one petabyte. - -All the above constants are of type `long`. - -#### 4.2.2 Basic EL Functions - -**String firstNotNull(String value1, String value2)** - -It returns the first not `null` value, or `null` if both are `null`. - -Note that if the output of this function is `null` and it is used as string, the EL library converts it to an empty -string. This is the common behavior when using `firstNotNull()` in node configuration sections. - -**String concat(String s1, String s2)** - -It returns the concatenation of 2 strings. A string with `null` value is considered as an empty string. - -**String replaceAll(String src, String regex, String replacement)** - -Replace each occurrence of regular expression match in the first string with the `replacement` string and return the replaced string. A 'regex' string with `null` value is considered as no change. A 'replacement' string with `null` value is consider as an empty string. - -**String appendAll(String src, String append, String delimeter)** - - Add the `append` string into each splitted sub-strings of the first string(`src`). The split is performed into `src` string using the `delimiter`. E.g. `appendAll("/a/b/,/c/b/,/c/d/", "ADD", ",")` will return `/a/b/ADD,/c/b/ADD,/c/d/ADD`. A `append` string with `null` value is consider as an empty string. A `delimiter` string with value `null` is considered as no append in the string. - -**String trim(String s)** - -It returns the trimmed value of the given string. A string with `null` value is considered as an empty string. - -**String urlEncode(String s)** - -It returns the URL UTF-8 encoded value of the given string. A string with `null` value is considered as an empty string. - -**String timestamp()** - -It returns the current datetime in ISO8601 format, down to minutes (yyyy-MM-ddTHH:mmZ), in the Oozie's processing timezone, -i.e. 1997-07-16T19:20Z - -**String toJsonStr(Map<String, String>)** (since Oozie 3.3) - -It returns an XML encoded JSON representation of a Map<String, String>. This function is useful to encode as -a single property the complete action-data of an action, **wf:actionData(String actionName)**, in order to pass -it in full to another action. - -**String toPropertiesStr(Map<String, String>)** (since Oozie 3.3) - -It returns an XML encoded Properties representation of a Map<String, String>. This function is useful to encode as -a single property the complete action-data of an action, **wf:actionData(String actionName)**, in order to pass -it in full to another action. - -**String toConfigurationStr(Map<String, String>)** (since Oozie 3.3) - -It returns an XML encoded Configuration representation of a Map<String, String>. This function is useful to encode as -a single property the complete action-data of an action, **wf:actionData(String actionName)**, in order to pass -it in full to another action. - -#### 4.2.3 Workflow EL Functions - -**String wf:id()** - -It returns the workflow job ID for the current workflow job. - -**String wf:name()** - -It returns the workflow application name for the current workflow job. - -**String wf:appPath()** - -It returns the workflow application path for the current workflow job. - -**String wf:conf(String name)** - -It returns the value of the workflow job configuration property for the current workflow job, or an empty string if -undefined. - -**String wf:user()** - -It returns the user name that started the current workflow job. - -**String wf:group()** - -It returns the group/ACL for the current workflow job. - -**String wf:callback(String stateVar)** - -It returns the callback URL for the current workflow action node, `stateVar` can be a valid exit state (`OK` or - `ERROR`) for the action or a token to be replaced with the exit state by the remote system executing the task. - -**String wf:transition(String node)** - -It returns the transition taken by the specified workflow action node, or an empty string if the action has not being -executed or it has not completed yet. - -**String wf:lastErrorNode()** - -It returns the name of the last workflow action node that exit with an `ERROR` exit state, or an empty string if no -action has exited with `ERROR` state in the current workflow job. - -**String wf:errorCode(String node)** - -It returns the error code for the specified action node, or an empty string if the action node has not exited -with `ERROR` state. - -Each type of action node must define its complete error code list. - -**String wf:errorMessage(String message)** - -It returns the error message for the specified action node, or an empty string if no action node has not exited -with `ERROR` state. - -The error message can be useful for debugging and notification purposes. - -**int wf:run()** - -It returns the run number for the current workflow job, normally `0` unless the workflow job is re-run, in which case -indicates the current run. - -**Map<String, String> wf:actionData(String node)** - -This function is only applicable to action nodes that produce output data on completion. - -The output data is in a Java Properties format and via this EL function it is available as a `Map<String, String>`. - -**String wf:actionExternalId(String node)** - -It returns the external Id for an action node, or an empty string if the action has not being executed or it has not -completed yet. - -**String wf:actionTrackerUri(String node)** - -It returns the tracker URI for an action node, or an empty string if the action has not being executed or it has not -completed yet. - -**String wf:actionExternalStatus(String node)** - -It returns the external status for an action node, or an empty string if the action has not being executed or it has -not completed yet. - -#### 4.2.4 Hadoop EL Constants - - * **RECORDS:** Hadoop record counters group name. - * **MAP_IN:** Hadoop mapper input records counter name. - * **MAP_OUT:** Hadoop mapper output records counter name. - * **REDUCE_IN:** Hadoop reducer input records counter name. - * **REDUCE_OUT:** Hadoop reducer input record counter name. - * **GROUPS:** 1024 *** Hadoop mapper/reducer record groups counter name. - -#### 4.2.5 Hadoop EL Functions - -<a name="HadoopCountersEL"></a> -**Map < String, Map < String, Long > > hadoop:counters(String node)** - -It returns the counters for a job submitted by a Hadoop action node. It returns `0` if the if the Hadoop job has not -started yet and for undefined counters. - -The outer Map key is a counter group name. The inner Map value is a Map of counter names and counter values. - -The Hadoop EL constants defined in the previous section provide access to the Hadoop built in record counters. - -This function can also be used to access specific action statistics information. Examples of action stats and their usage through EL Functions (referenced in workflow xml) are given below. - -**Example of MR action stats:** - -``` -{ - "ACTION_TYPE": "MAP_REDUCE", - "org.apache.hadoop.mapred.JobInProgress$Counter": { - "TOTAL_LAUNCHED_REDUCES": 1, - "TOTAL_LAUNCHED_MAPS": 1, - "DATA_LOCAL_MAPS": 1 - }, - "FileSystemCounters": { - "FILE_BYTES_READ": 1746, - "HDFS_BYTES_READ": 1409, - "FILE_BYTES_WRITTEN": 3524, - "HDFS_BYTES_WRITTEN": 1547 - }, - "org.apache.hadoop.mapred.Task$Counter": { - "REDUCE_INPUT_GROUPS": 33, - "COMBINE_OUTPUT_RECORDS": 0, - "MAP_INPUT_RECORDS": 33, - "REDUCE_SHUFFLE_BYTES": 0, - "REDUCE_OUTPUT_RECORDS": 33, - "SPILLED_RECORDS": 66, - "MAP_OUTPUT_BYTES": 1674, - "MAP_INPUT_BYTES": 1409, - "MAP_OUTPUT_RECORDS": 33, - "COMBINE_INPUT_RECORDS": 0, - "REDUCE_INPUT_RECORDS": 33 - } -} -``` - -Below is the workflow that describes how to access specific information using hadoop:counters() EL function from the MR stats. -**Workflow xml:** - -``` -<workflow-app xmlns="uri:oozie:workflow:1.0" name="map-reduce-wf"> - <start to="mr-node"/> - <action name="mr-node"> - <map-reduce> - <resource-manager>${resourceManager}</resource-manager> - <name-node>${nameNode}</name-node> - <prepare> - <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}"/> - </prepare> - <configuration> - <property> - <name>mapred.job.queue.name</name> - <value>${queueName}</value> - </property> - <property> - <name>mapred.mapper.class</name> - <value>org.apache.oozie.example.SampleMapper</value> - </property> - <property> - <name>mapred.reducer.class</name> - <value>org.apache.oozie.example.SampleReducer</value> - </property> - <property> - <name>mapred.map.tasks</name> - <value>1</value> - </property> - <property> - <name>mapred.input.dir</name> - <value>/user/${wf:user()}/${examplesRoot}/input-data/text</value> - </property> - <property> - <name>mapred.output.dir</name> - <value>/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}</value> - </property> - <property> - <name>oozie.action.external.stats.write</name> - <value>true</value> - </property> - </configuration> - </map-reduce> - <ok to="java1"/> - <error to="fail"/> - </action> - <action name="java1"> - <java> - <resource-manager>${resourceManager}</resource-manager> - <name-node>${nameNode}</name-node> - <configuration> - <property> - <name>mapred.job.queue.name</name> - <value>${queueName}</value> - </property> - </configuration> - <main-class>MyTest</main-class> - <arg> ${hadoop:counters("mr-node")["FileSystemCounters"]["FILE_BYTES_READ"]}</arg> - <capture-output/> - </java> - <ok to="end" /> - <error to="fail" /> - </action> - <kill name="fail"> - <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> - </kill> - <end name="end"
<TRUNCATED>