Added:
websites/staging/oozie/trunk/content/docs/5.2.0/WorkflowFunctionalSpec.html
==============================================================================
--- websites/staging/oozie/trunk/content/docs/5.2.0/WorkflowFunctionalSpec.html
(added)
+++ websites/staging/oozie/trunk/content/docs/5.2.0/WorkflowFunctionalSpec.html
Fri Dec 6 09:10:49 2019
@@ -0,0 +1,4960 @@
+<!DOCTYPE html>
+<!--
+ | Generated by Apache Maven Doxia at 2019-12-05
+ | Rendered using Apache Maven Fluido Skin 1.4
+-->
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
+ <head>
+ <meta charset="UTF-8" />
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+ <meta name="Date-Revision-yyyymmdd" content="20191205" />
+ <meta http-equiv="Content-Language" content="en" />
+ <title>Oozie – </title>
+ <link rel="stylesheet" href="./css/apache-maven-fluido-1.4.min.css" />
+ <link rel="stylesheet" href="./css/site.css" />
+ <link rel="stylesheet" href="./css/print.css" media="print" />
+
+
+ <script type="text/javascript"
src="./js/apache-maven-fluido-1.4.min.js"></script>
+
+
+ </head>
+ <body class="topBarDisabled">
+
+
+
+ <div class="container-fluid">
+ <div id="banner">
+ <div class="pull-left">
+ <a href="https://oozie.apache.org/"
id="bannerLeft">
+
<img src="https://oozie.apache.org/images/oozie_200x.png"
alt="Oozie"/>
+ </a>
+ </div>
+ <div class="pull-right"> </div>
+ <div class="clear"><hr/></div>
+ </div>
+
+ <div id="breadcrumbs">
+ <ul class="breadcrumb">
+
+
+ <li class="">
+ <a href="http://www.apache.org/" class="externalLink"
title="Apache">
+ Apache</a>
+ <span class="divider">/</span>
+ </li>
+ <li class="">
+ <a href="../../" title="Oozie">
+ Oozie</a>
+ <span class="divider">/</span>
+ </li>
+ <li class="">
+ <a href="../" title="docs">
+ docs</a>
+ <span class="divider">/</span>
+ </li>
+ <li class="">
+ <a href="./" title="5.2.0">
+ 5.2.0</a>
+ <span class="divider">/</span>
+ </li>
+ <li class="active "></li>
+
+
+
+ <li id="publishDate" class="pull-right"><span
class="divider">|</span> Last Published: 2019-12-05</li>
+ <li id="projectVersion" class="pull-right">
+ Version: 5.2.0
+ </li>
+
+ </ul>
+ </div>
+
+
+ <div class="row-fluid">
+ <div id="leftColumn" class="span2">
+ <div class="well sidebar-nav">
+
+
+ <ul class="nav nav-list">
+ </ul>
+
+
+
+ <hr />
+
+ <div id="poweredBy">
+ <div class="clear"></div>
+ <div class="clear"></div>
+ <div class="clear"></div>
+ <div class="clear"></div>
+ <a href="http://maven.apache.org/" title="Built
by Maven" class="poweredBy">
+ <img class="builtBy" alt="Built by Maven"
src="./images/logos/maven-feather.png" />
+ </a>
+ </div>
+ </div>
+ </div>
+
+
+ <div id="bodyColumn" class="span10" >
+
+ <p><a href="index.html">::Go back to Oozie Documentation
Index::</a></p><hr />
+<h1>Oozie Specification, a Hadoop Workflow System</h1>
+<p>The goal of this document is to define a workflow engine system specialized
in coordinating the execution of Hadoop Map/Reduce and Pig jobs.</p>
+<ul>
+<li><a href="#Changelog">Changelog</a></li>
+<li><a href="#a0_Definitions">0 Definitions</a></li>
+<li><a href="#a1_Specification_Highlights">1 Specification Highlights</a></li>
+<li><a href="#a2_Workflow_Definition">2 Workflow Definition</a>
+<ul>
+<li><a href="#a2.1_Cycles_in_Workflow_Definitions">2.1 Cycles in Workflow
Definitions</a></li></ul></li>
+<li><a href="#a3_Workflow_Nodes">3 Workflow Nodes</a>
+<ul>
+<li><a href="#a3.1_Control_Flow_Nodes">3.1 Control Flow Nodes</a>
+<ul>
+<li><a href="#a3.1.1_Start_Control_Node">3.1.1 Start Control Node</a></li>
+<li><a href="#a3.1.2_End_Control_Node">3.1.2 End Control Node</a></li>
+<li><a href="#a3.1.3_Kill_Control_Node">3.1.3 Kill Control Node</a></li>
+<li><a href="#a3.1.4_Decision_Control_Node">3.1.4 Decision Control
Node</a></li>
+<li><a href="#a3.1.5_Fork_and_Join_Control_Nodes">3.1.5 Fork and Join Control
Nodes</a></li></ul></li>
+<li><a href="#a3.2_Workflow_Action_Nodes">3.2 Workflow Action Nodes</a>
+<ul>
+<li><a href="#a3.2.1_Action_Basis">3.2.1 Action Basis</a>
+<ul>
+<li><a href="#a3.2.1.1_Action_ComputationProcessing_Is_Always_Remote">3.2.1.1
Action Computation/Processing Is Always Remote</a></li>
+<li><a href="#a3.2.1.2_Actions_Are_Asynchronous">3.2.1.2 Actions Are
Asynchronous</a></li>
+<li><a href="#a3.2.1.3_Actions_Have_2_Transitions_ok_and_error">3.2.1.3
Actions Have 2 Transitions, ok and error</a></li>
+<li><a href="#a3.2.1.4_Action_Recovery">3.2.1.4 Action
Recovery</a></li></ul></li>
+<li><a href="#a3.2.2_Map-Reduce_Action">3.2.2 Map-Reduce Action</a>
+<ul>
+<li><a href="#a3.2.2.1_Adding_Files_and_Archives_for_the_Job">3.2.2.1 Adding
Files and Archives for the Job</a></li>
+<li><a
href="#a3.2.2.2_Configuring_the_MapReduce_action_with_Java_code">3.2.2.2
Configuring the MapReduce action with Java code</a></li>
+<li><a href="#a3.2.2.3_Streaming">3.2.2.3 Streaming</a></li>
+<li><a href="#a3.2.2.4_Pipes">3.2.2.4 Pipes</a></li>
+<li><a href="#a3.2.2.5_Syntax">3.2.2.5 Syntax</a></li></ul></li>
+<li><a href="#a3.2.3_Pig_Action">3.2.3 Pig Action</a></li>
+<li><a href="#a3.2.4_Fs_HDFS_action">3.2.4 Fs (HDFS) action</a></li>
+<li><a href="#a3.2.5_Sub-workflow_Action">3.2.5 Sub-workflow Action</a></li>
+<li><a href="#a3.2.6_Java_Action">3.2.6 Java Action</a>
+<ul>
+<li><a href="#a3.2.6.1_Overriding_an_actions_Main_class">3.2.6.1 Overriding an
action’s Main class</a></li></ul></li></ul></li></ul></li>
+<li><a href="#a4_Parameterization_of_Workflows">4 Parameterization of
Workflows</a>
+<ul>
+<li><a href="#a4.1_Workflow_Job_Properties_or_Parameters">4.1 Workflow Job
Properties (or Parameters)</a></li>
+<li><a href="#a4.2_Expression_Language_Functions">4.2 Expression Language
Functions</a>
+<ul>
+<li><a href="#a4.2.1_Basic_EL_Constants">4.2.1 Basic EL Constants</a></li>
+<li><a href="#a4.2.2_Basic_EL_Functions">4.2.2 Basic EL Functions</a></li>
+<li><a href="#a4.2.3_Workflow_EL_Functions">4.2.3 Workflow EL
Functions</a></li>
+<li><a href="#a4.2.4_Hadoop_EL_Constants">4.2.4 Hadoop EL Constants</a></li>
+<li><a href="#a4.2.5_Hadoop_EL_Functions">4.2.5 Hadoop EL Functions</a></li>
+<li><a href="#a4.2.6_Hadoop_Jobs_EL_Function">4.2.6 Hadoop Jobs EL
Function</a></li>
+<li><a href="#a4.2.7_HDFS_EL_Functions">4.2.7 HDFS EL Functions</a></li>
+<li><a href="#a4.2.8_HCatalog_EL_Functions">4.2.8 HCatalog EL
Functions</a></li></ul></li></ul></li>
+<li><a href="#a5_Workflow_Notifications">5 Workflow Notifications</a>
+<ul>
+<li><a href="#a5.1_Workflow_Job_Status_Notification">5.1 Workflow Job Status
Notification</a></li>
+<li><a href="#a5.2_Node_Start_and_End_Notifications">5.2 Node Start and End
Notifications</a></li></ul></li>
+<li><a href="#a6_User_Propagation">6 User Propagation</a></li>
+<li><a href="#a7_Workflow_Application_Deployment">7 Workflow Application
Deployment</a></li>
+<li><a href="#a8_External_Data_Assumptions">8 External Data
Assumptions</a></li>
+<li><a href="#a9_Workflow_Jobs_Lifecycle">9 Workflow Jobs Lifecycle</a>
+<ul>
+<li><a href="#a9.1_Workflow_Job_Lifecycle">9.1 Workflow Job Lifecycle</a></li>
+<li><a href="#a9.2_Workflow_Action_Lifecycle">9.2 Workflow Action
Lifecycle</a></li></ul></li>
+<li><a href="#a10_Workflow_Jobs_Recovery_re-run">10 Workflow Jobs Recovery
(re-run)</a></li>
+<li><a href="#a11_Oozie_Web_Services_API">11 Oozie Web Services API</a></li>
+<li><a href="#a12_Client_API">12 Client API</a></li>
+<li><a href="#a13_Command_Line_Tools">13 Command Line Tools</a></li>
+<li><a href="#a14_Web_UI_Console">14 Web UI Console</a></li>
+<li><a href="#a15_Customizing_Oozie_with_Extensions">15 Customizing Oozie with
Extensions</a></li>
+<li><a href="#a16_Workflow_Jobs_Priority">16 Workflow Jobs Priority</a></li>
+<li><a
href="#a17_HDFS_Share_Libraries_for_Workflow_Applications_since_Oozie_2.3">17
HDFS Share Libraries for Workflow Applications (since Oozie 2.3)</a>
+<ul>
+<li><a href="#a17.1_Action_Share_Library_Override_since_Oozie_3.3">17.1 Action
Share Library Override (since Oozie 3.3)</a></li>
+<li><a href="#a17.2_Action_Share_Library_Exclude_since_Oozie_5.2">17.2 Action
Share Library Exclude (since Oozie 5.2)</a></li></ul></li>
+<li><a href="#a18_User-Retry_for_Workflow_Actions_since_Oozie_3.1">18
User-Retry for Workflow Actions (since Oozie 3.1)</a></li>
+<li><a href="#a19_Global_Configurations">19 Global Configurations</a></li>
+<li><a href="#a20_Suspend_On_Nodes">20 Suspend On Nodes</a></li>
+<li><a href="#Appendixes">Appendixes</a>
+<ul>
+<li><a href="#Appendix_A_Oozie_Workflow_and_Common_XML_Schemas">Appendix A,
Oozie Workflow and Common XML Schemas</a>
+<ul>
+<li><a href="#Oozie_Workflow_Schema_Version_1.0">Oozie Workflow Schema Version
1.0</a></li>
+<li><a href="#Oozie_Common_Schema_Version_1.0">Oozie Common Schema Version
1.0</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.5">Oozie Workflow Schema Version
0.5</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.4.5">Oozie Workflow Schema
Version 0.4.5</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.4">Oozie Workflow Schema Version
0.4</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.3">Oozie Workflow Schema Version
0.3</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.2.5">Oozie Workflow Schema
Version 0.2.5</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.2">Oozie Workflow Schema Version
0.2</a></li>
+<li><a href="#Oozie_SLA_Version_0.2">Oozie SLA Version 0.2</a></li>
+<li><a href="#Oozie_SLA_Version_0.1">Oozie SLA Version 0.1</a></li>
+<li><a href="#Oozie_Workflow_Schema_Version_0.1">Oozie Workflow Schema Version
0.1</a></li></ul></li>
+<li><a href="#Appendix_B_Workflow_Examples">Appendix B, Workflow Examples</a>
+<ul>
+<li><a href="#Fork_and_Join_Example">Fork and Join
Example</a></li></ul></li></ul></li></ul>
+
+<div class="section">
+<h2><a name="Changelog"></a>Changelog</h2>
+<p><b>2016FEB19</b></p>
+<ul>
+
+<li>3.2.7 Updated notes on System.exit(int n) behavior</li>
+</ul>
+<p><b>2015APR29</b></p>
+<ul>
+
+<li>3.2.1.4 Added notes about Java action retries</li>
+<li>3.2.7 Added notes about Java action retries</li>
+</ul>
+<p><b>2014MAY08</b></p>
+<ul>
+
+<li>3.2.2.4 Added support for fully qualified job-xml path</li>
+</ul>
+<p><b>2013JUL03</b></p>
+<ul>
+
+<li>Appendix A, Added new workflow schema 0.5 and SLA schema 0.2</li>
+</ul>
+<p><b>2012AUG30</b></p>
+<ul>
+
+<li>4.2.2 Added two EL functions (replaceAll and appendAll)</li>
+</ul>
+<p><b>2012JUL26</b></p>
+<ul>
+
+<li>Appendix A, updated XML schema 0.4 to include <tt>parameters</tt>
element</li>
+<li>4.1 Updated to mention about <tt>parameters</tt> element as of schema
0.4</li>
+</ul>
+<p><b>2012JUL23</b></p>
+<ul>
+
+<li>Appendix A, updated XML schema 0.4 (Fs action)</li>
+<li>3.2.4 Updated to mention that a <tt>name-node</tt>, a <tt>job-xml</tt>,
and a <tt>configuration</tt> element are allowed in the Fs action as of schema
0.4</li>
+</ul>
+<p><b>2012JUN19</b></p>
+<ul>
+
+<li>Appendix A, added XML schema 0.4</li>
+<li>3.2.2.4 Updated to mention that multiple <tt>job-xml</tt> elements are
allowed as of schema 0.4</li>
+<li>3.2.3 Updated to mention that multiple <tt>job-xml</tt> elements are
allowed as of schema 0.4</li>
+</ul>
+<p><b>2011AUG17</b></p>
+<ul>
+
+<li>3.2.4 fs ‘chmod’ xml closing element typo in Example
corrected</li>
+</ul>
+<p><b>2011AUG12</b></p>
+<ul>
+
+<li>3.2.4 fs ‘move’ action characteristics updated, to allow for
consistent source and target paths and existing target path only if
directory</li>
+<li>18, Update the doc for user-retry of workflow action.</li>
+</ul>
+<p><b>2011FEB19</b></p>
+<ul>
+
+<li>10, Update the doc to rerun from the failed node.</li>
+</ul>
+<p><b>2010OCT31</b></p>
+<ul>
+
+<li>17, Added new section on Shared Libraries</li>
+</ul>
+<p><b>2010APR27</b></p>
+<ul>
+
+<li>3.2.3 Added new “arguments” tag to PIG actions</li>
+<li>3.2.5 SSH actions are deprecated in Oozie schema 0.1 and removed in Oozie
schema 0.2</li>
+<li>Appendix A, Added schema version 0.2</li>
+</ul>
+<p><b>2009OCT20</b></p>
+<ul>
+
+<li>Appendix A, updated XML schema</li>
+</ul>
+<p><b>2009SEP15</b></p>
+<ul>
+
+<li>3.2.6 Removing support for sub-workflow in a different Oozie instance
(removing the ‘oozie’ element)</li>
+</ul>
+<p><b>2009SEP07</b></p>
+<ul>
+
+<li>3.2.2.3 Added Map Reduce Pipes specifications.</li>
+<li>3.2.2.4 Map-Reduce Examples. Previously was 3.2.2.3.</li>
+</ul>
+<p><b>2009SEP02</b></p>
+<ul>
+
+<li>10 Added missing skip nodes property name.</li>
+<li>3.2.1.4 Reworded action recovery explanation.</li>
+</ul>
+<p><b>2009AUG26</b></p>
+<ul>
+
+<li>3.2.9 Added <tt>java</tt> action type</li>
+<li>3.1.4 Example uses EL constant to refer to counter group/name</li>
+</ul>
+<p><b>2009JUN09</b></p>
+<ul>
+
+<li>12.2.4 Added build version resource to admin end-point</li>
+<li>3.2.6 Added flag to propagate workflow configuration to sub-workflows</li>
+<li>10 Added behavior for workflow job parameters given in the rerun</li>
+<li>11.3.4 workflows info returns pagination information</li>
+</ul>
+<p><b>2009MAY18</b></p>
+<ul>
+
+<li>3.1.4 decision node, ‘default’ element, ‘name’
attribute changed to ‘to’</li>
+<li>3.1.5 fork node, ‘transition’ element changed to
‘start’, ‘to’ attribute change to
‘path’</li>
+<li>3.1.5 join node, ‘transition’ element remove, added
‘to’ attribute to ‘join’ element</li>
+<li>3.2.1.4 Rewording on action recovery section</li>
+<li>3.2.2 map-reduce action, added ‘job-tracker’,
‘name-node’ actions, ‘file’, ‘file’
and ‘archive’ elements</li>
+<li>3.2.2.1 map-reduce action, remove from ‘streaming’ element
‘file’, ‘file’ and ‘archive’
elements</li>
+<li>3.2.2.2 map-reduce action, reorganized streaming section</li>
+<li>3.2.3 pig action, removed information about implementation (SSH), changed
elements names</li>
+<li>3.2.4 fs action, removed ‘fs-uri’ and
‘user-name’ elements, file system URI is now specified in path,
user is propagated</li>
+<li>3.2.6 sub-workflow action, renamed elements ‘oozie-url’ to
‘oozie’ and ‘workflow-app’ to
‘app-path’</li>
+<li>4 Properties that are valid Java identifiers can be used as ${NAME}</li>
+<li>4.1 Renamed default properties file from ‘configuration.xml’
to ‘default-configuration.xml’</li>
+<li>4.2 Changes in EL Constants and Functions</li>
+<li>5 Updated notification behavior and tokens</li>
+<li>6 Changed user propagation behavior</li>
+<li>7 Changed application packaging from ZIP to HDFS directory</li>
+<li>Removed application lifecycle and self containment model sections</li>
+<li>10 Changed workflow job recovery, simplified recovery behavior</li>
+<li>11 Detailed Web Services API</li>
+<li>12 Updated Client API section</li>
+<li>15 Updated Action Executor API section</li>
+<li>Appendix A XML namespace updated to
‘uri:oozie:workflow:0.1’</li>
+<li>Appendix A Updated XML schema to changes in map-reduce/pig/fs/ssh
actions</li>
+<li>Appendix B Updated workflow example to schema changes</li>
+</ul>
+<p><b>2009MAR25</b></p>
+<ul>
+
+<li>Changing all references of HWS to Oozie (project name)</li>
+<li>Typos, XML Formatting</li>
+<li>XML Schema URI correction</li>
+</ul>
+<p><b>2009MAR09</b></p>
+<ul>
+
+<li>Changed <tt>CREATED</tt> job state to <tt>PREP</tt> to have same states as
Hadoop</li>
+<li>Renamed ‘hadoop-workflow’ element to
‘workflow-app’</li>
+<li>Decision syntax changed to be ‘switch/case’ with no
transition indirection</li>
+<li>Action nodes common root element ‘action’, with the action
type as sub-element (using a single built-in XML schema)</li>
+<li>Action nodes have 2 explicit transitions ‘ok to’ and
‘error to’ enforced by XML schema</li>
+<li>Renamed ‘fail’ action element to ‘kill’</li>
+<li>Renamed ‘hadoop’ action element to
‘map-reduce’</li>
+<li>Renamed ‘hdfs’ action element to ‘fs’</li>
+<li>Updated all XML snippets and examples</li>
+<li>Made user propagation simpler and consistent</li>
+<li>Added Oozie XML schema to Appendix A</li>
+<li>Added workflow example to Appendix B</li>
+</ul>
+<p><b>2009FEB22</b></p>
+<ul>
+
+<li>Opened <a class="externalLink"
href="https://issues.apache.org/jira/browse/HADOOP-5303">JIRA
HADOOP-5303</a></li>
+</ul>
+<p><b>27/DEC/2012:</b></p>
+<ul>
+
+<li>Added information on dropping hcatalog table partitions in prepare
block</li>
+<li>Added hcatalog EL functions section</li>
+</ul></div>
+<div class="section">
+<h2><a name="a0_Definitions"></a>0 Definitions</h2>
+<p><b>Action:</b> An execution/computation task (Map-Reduce job, Pig job, a
shell command). It can also be referred as task or ‘action
node’.</p>
+<p><b>Workflow:</b> A collection of actions arranged in a control dependency
DAG (Directed Acyclic Graph). “control dependency” from one
action to another means that the second action can’t run until the first
action has completed.</p>
+<p><b>Workflow Definition:</b> A programmatic description of a workflow that
can be executed.</p>
+<p><b>Workflow Definition Language:</b> The language used to define a Workflow
Definition.</p>
+<p><b>Workflow Job:</b> An executable instance of a workflow definition.</p>
+<p><b>Workflow Engine:</b> A system that executes workflows jobs. It can also
be referred as a DAG engine.</p></div>
+<div class="section">
+<h2><a name="a1_Specification_Highlights"></a>1 Specification Highlights</h2>
+<p>A Workflow application is DAG that coordinates the following types of
actions: Hadoop, Pig, and sub-workflows.</p>
+<p>Flow control operations within the workflow applications can be done using
decision, fork and join nodes. Cycles in workflows are not supported.</p>
+<p>Actions and decisions can be parameterized with job properties, actions
output (i.e. Hadoop counters) and file information (file exists, file size,
etc). Formal parameters are expressed in the workflow definition as
<tt>${VAR}</tt> variables.</p>
+<p>A Workflow application is a ZIP file that contains the workflow definition
(an XML file), all the necessary files to run all the actions: JAR files for
Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Pig
scripts, and other resource files.</p>
+<p>Before running a workflow job, the corresponding workflow application must
be deployed in Oozie.</p>
+<p>Deploying workflow application and running workflow jobs can be done via
command line tools, a WS API and a Java API.</p>
+<p>Monitoring the system and workflow jobs can be done via a web console,
command line tools, a WS API and a Java API.</p>
+<p>When submitting a workflow job, a set of properties resolving all the
formal parameters in the workflow definitions must be provided. This set of
properties is a Hadoop configuration.</p>
+<p>Possible states for a workflow jobs are: <tt>PREP</tt>, <tt>RUNNING</tt>,
<tt>SUSPENDED</tt>, <tt>SUCCEEDED</tt>, <tt>KILLED</tt> and <tt>FAILED</tt>.</p>
+<p>In the case of a action start failure in a workflow job, depending on the
type of failure, Oozie will attempt automatic retries, it will request a manual
retry or it will fail the workflow job.</p>
+<p>Oozie can make HTTP callback notifications on action start/end/failure
events and workflow end/failure events.</p>
+<p>In the case of workflow job failure, the workflow job can be resubmitted
skipping previously completed actions. Before doing a resubmission the workflow
application could be updated with a patch to fix a problem in the workflow
application code.</p>
+<p><a name="WorkflowDefinition"></a></p></div>
+<div class="section">
+<h2><a name="a2_Workflow_Definition"></a>2 Workflow Definition</h2>
+<p>A workflow definition is a DAG with control flow nodes (start, end,
decision, fork, join, kill) or action nodes (map-reduce, pig, etc.), nodes are
connected by transitions arrows.</p>
+<p>The workflow definition language is XML based and it is called hPDL (Hadoop
Process Definition Language).</p>
+<p>Refer to the Appendix A for the<a
href="WorkflowFunctionalSpec.html#OozieWFSchema">Oozie Workflow Definition XML
Schema</a>. Appendix B has <a
href="WorkflowFunctionalSpec.html#OozieWFExamples">Workflow Definition
Examples</a>.</p>
+<div class="section">
+<h3><a name="a2.1_Cycles_in_Workflow_Definitions"></a>2.1 Cycles in Workflow
Definitions</h3>
+<p>Oozie does not support cycles in workflow definitions, workflow definitions
must be a strict DAG.</p>
+<p>At workflow application deployment time, if Oozie detects a cycle in the
workflow definition it must fail the deployment.</p></div></div>
+<div class="section">
+<h2><a name="a3_Workflow_Nodes"></a>3 Workflow Nodes</h2>
+<p>Workflow nodes are classified in control flow nodes and action nodes:</p>
+<ul>
+
+<li><b>Control flow nodes:</b> nodes that control the start and end of the
workflow and workflow job execution path.</li>
+<li><b>Action nodes:</b> nodes that trigger the execution of a
computation/processing task.</li>
+</ul>
+<p>Node names and transitions must be conform to the following pattern
<tt>[a-zA-Z][\-_a-zA-Z0-0]*</tt>, of up to 20 characters long.</p>
+<div class="section">
+<h3><a name="a3.1_Control_Flow_Nodes"></a>3.1 Control Flow Nodes</h3>
+<p>Control flow nodes define the beginning and the end of a workflow (the
<tt>start</tt>, <tt>end</tt> and <tt>kill</tt> nodes) and provide a mechanism
to control the workflow execution path (the <tt>decision</tt>, <tt>fork</tt>
and <tt>join</tt> nodes).</p>
+<p><a name="StartNode"></a></p>
+<div class="section">
+<h4><a name="a3.1.1_Start_Control_Node"></a>3.1.1 Start Control Node</h4>
+<p>The <tt>start</tt> node is the entry point for a workflow job, it indicates
the first workflow node the workflow job must transition to.</p>
+<p>When a workflow is started, it automatically transitions to the node
specified in the <tt>start</tt>.</p>
+<p>A workflow definition must have one <tt>start</tt> node.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <start to="[NODE-NAME]"/>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>to</tt> attribute is the name of first workflow node to execute.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="foo-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <start to="firstHadoopJob"/>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><a name="EndNode"></a></p></div>
+<div class="section">
+<h4><a name="a3.1.2_End_Control_Node"></a>3.1.2 End Control Node</h4>
+<p>The <tt>end</tt> node is the end for a workflow job, it indicates that the
workflow job has completed successfully.</p>
+<p>When a workflow job reaches the <tt>end</tt> it finishes successfully
(SUCCEEDED).</p>
+<p>If one or more actions started by the workflow job are executing when the
<tt>end</tt> node is reached, the actions will be killed. In this scenario the
workflow job is still considered as successfully run.</p>
+<p>A workflow definition must have one <tt>end</tt> node.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <end name="[NODE-NAME]"/>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>name</tt> attribute is the name of the transition to do to end the
workflow job.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="foo-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <end name="end"/>
+</workflow-app>
+</pre></div></div>
+
+<p><a name="KillNode"></a></p></div>
+<div class="section">
+<h4><a name="a3.1.3_Kill_Control_Node"></a>3.1.3 Kill Control Node</h4>
+<p>The <tt>kill</tt> node allows a workflow job to kill itself.</p>
+<p>When a workflow job reaches the <tt>kill</tt> it finishes in error
(KILLED).</p>
+<p>If one or more actions started by the workflow job are executing when the
<tt>kill</tt> node is reached, the actions will be killed.</p>
+<p>A workflow definition may have zero or more <tt>kill</tt> nodes.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <kill name="[NODE-NAME]">
+ <message>[MESSAGE-TO-LOG]</message>
+ </kill>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>name</tt> attribute in the <tt>kill</tt> node is the name of the
Kill action node.</p>
+<p>The content of the <tt>message</tt> element will be logged as the kill
reason for the workflow job.</p>
+<p>A <tt>kill</tt> node does not have transition elements because it ends the
workflow job, as <tt>KILLED</tt>.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="foo-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <kill name="killBecauseNoInput">
+ <message>Input unavailable</message>
+ </kill>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><a name="DecisionNode"></a></p></div>
+<div class="section">
+<h4><a name="a3.1.4_Decision_Control_Node"></a>3.1.4 Decision Control Node</h4>
+<p>A <tt>decision</tt> node enables a workflow to make a selection on the
execution path to follow.</p>
+<p>The behavior of a <tt>decision</tt> node can be seen as a switch-case
statement.</p>
+<p>A <tt>decision</tt> node consists of a list of predicates-transition pairs
plus a default transition. Predicates are evaluated in order or appearance
until one of them evaluates to <tt>true</tt> and the corresponding transition
is taken. If none of the predicates evaluates to <tt>true</tt> the
<tt>default</tt> transition is taken.</p>
+<p>Predicates are JSP Expression Language (EL) expressions (refer to section
4.2 of this document) that resolve into a boolean value, <tt>true</tt> or
<tt>false</tt>. For example:</p>
+
+<div>
+<div>
+<pre class="source"> ${fs:fileSize('/usr/foo/myinputdir') gt 10 * GB}
+</pre></div></div>
+
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <decision name="[NODE-NAME]">
+ <switch>
+ <case to="[NODE_NAME]">[PREDICATE]</case>
+ ...
+ <case to="[NODE_NAME]">[PREDICATE]</case>
+ <default to="[NODE_NAME]"/>
+ </switch>
+ </decision>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>name</tt> attribute in the <tt>decision</tt> node is the name of
the decision node.</p>
+<p>Each <tt>case</tt> elements contains a predicate and a transition name. The
predicate ELs are evaluated in order until one returns <tt>true</tt> and the
corresponding transition is taken.</p>
+<p>The <tt>default</tt> element indicates the transition to take if none of
the predicates evaluates to <tt>true</tt>.</p>
+<p>All decision nodes must have a <tt>default</tt> element to avoid bringing
the workflow into an error state if none of the predicates evaluates to
true.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="foo-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <decision name="mydecision">
+ <switch>
+ <case to="reconsolidatejob">
+ ${fs:fileSize(secondjobOutputDir) gt 10 * GB}
+ </case> <case to="rexpandjob">
+ ${fs:fileSize(secondjobOutputDir) lt 100 * MB}
+ </case>
+ <case to="recomputejob">
+ ${ hadoop:counters('secondjob')[RECORDS][REDUCE_OUT] lt 1000000 }
+ </case>
+ <default to="end"/>
+ </switch>
+ </decision>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><a name="ForkJoinNodes"></a></p></div>
+<div class="section">
+<h4><a name="a3.1.5_Fork_and_Join_Control_Nodes"></a>3.1.5 Fork and Join
Control Nodes</h4>
+<p>A <tt>fork</tt> node splits one path of execution into multiple concurrent
paths of execution.</p>
+<p>A <tt>join</tt> node waits until every concurrent execution path of a
previous <tt>fork</tt> node arrives to it.</p>
+<p>The <tt>fork</tt> and <tt>join</tt> nodes must be used in pairs. The
<tt>join</tt> node assumes concurrent execution paths are children of the same
<tt>fork</tt> node.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <fork name="[FORK-NODE-NAME]">
+ <path start="[NODE-NAME]" />
+ ...
+ <path start="[NODE-NAME]" />
+ </fork>
+ ...
+ <join name="[JOIN-NODE-NAME]" to="[NODE-NAME]" />
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>name</tt> attribute in the <tt>fork</tt> node is the name of the
workflow fork node. The <tt>start</tt> attribute in the <tt>path</tt> elements
in the <tt>fork</tt> node indicate the name of the workflow node that will be
part of the concurrent execution paths.</p>
+<p>The <tt>name</tt> attribute in the <tt>join</tt> node is the name of the
workflow join node. The <tt>to</tt> attribute in the <tt>join</tt> node
indicates the name of the workflow node that will executed after all concurrent
execution paths of the corresponding fork arrive to the join node.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <fork name="forking">
+ <path start="firstparalleljob"/>
+ <path start="secondparalleljob"/>
+ </fork>
+ <action name="firstparallejob">
+ <map-reduce>
+ <resource-manager>foo:8032</resource-manager>
+ <name-node>bar:8020</name-node>
+ <job-xml>job1.xml</job-xml>
+ </map-reduce>
+ <ok to="joining"/>
+ <error to="kill"/>
+ </action>
+ <action name="secondparalleljob">
+ <map-reduce>
+ <resource-manager>foo:8032</resource-manager>
+ <name-node>bar:8020</name-node>
+ <job-xml>job2.xml</job-xml>
+ </map-reduce>
+ <ok to="joining"/>
+ <error to="kill"/>
+ </action>
+ <join name="joining" to="nextaction"/>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>By default, Oozie performs some validation that any forking in a workflow
is valid and won’t lead to any incorrect behavior or instability.
However, if Oozie is preventing a workflow from being submitted and you are
very certain that it should work, you can disable forkjoin validation so that
Oozie will accept the workflow. To disable this validation just for a specific
workflow, simply set <tt>oozie.wf.validate.ForkJoin</tt> to <tt>false</tt> in
the job.properties file. To disable this validation for all workflows, simply
set <tt>oozie.validate.ForkJoin</tt> to <tt>false</tt> in the oozie-site.xml
file. Disabling this validation is determined by the AND of both of these
properties, so it will be disabled if either or both are set to false and only
enabled if both are set to true (or not specified).</p>
+<p><a name="ActionNodes"></a></p></div></div>
+<div class="section">
+<h3><a name="a3.2_Workflow_Action_Nodes"></a>3.2 Workflow Action Nodes</h3>
+<p>Action nodes are the mechanism by which a workflow triggers the execution
of a computation/processing task.</p>
+<div class="section">
+<h4><a name="a3.2.1_Action_Basis"></a>3.2.1 Action Basis</h4>
+<p>The following sub-sections define common behavior and capabilities for all
action types.</p>
+<div class="section">
+<h5><a
name="a3.2.1.1_Action_ComputationProcessing_Is_Always_Remote"></a>3.2.1.1
Action Computation/Processing Is Always Remote</h5>
+<p>All computation/processing tasks triggered by an action node are remote to
Oozie. No workflow application specific computation/processing task is executed
within Oozie.</p></div>
+<div class="section">
+<h5><a name="a3.2.1.2_Actions_Are_Asynchronous"></a>3.2.1.2 Actions Are
Asynchronous</h5>
+<p>All computation/processing tasks triggered by an action node are executed
asynchronously by Oozie. For most types of computation/processing tasks
triggered by workflow action, the workflow job has to wait until the
computation/processing task completes before transitioning to the following
node in the workflow.</p>
+<p>The exception is the <tt>fs</tt> action that is handled as a synchronous
action.</p>
+<p>Oozie can detect completion of computation/processing tasks by two
different means, callbacks and polling.</p>
+<p>When a computation/processing tasks is started by Oozie, Oozie provides a
unique callback URL to the task, the task should invoke the given URL to notify
its completion.</p>
+<p>For cases that the task failed to invoke the callback URL for any reason
(i.e. a transient network failure) or when the type of task cannot invoke the
callback URL upon completion, Oozie has a mechanism to poll
computation/processing tasks for completion.</p></div>
+<div class="section">
+<h5><a name="a3.2.1.3_Actions_Have_2_Transitions_ok_and_error"></a>3.2.1.3
Actions Have 2 Transitions, <tt>ok</tt> and <tt>error</tt></h5>
+<p>If a computation/processing task -triggered by a workflow- completes
successfully, it transitions to <tt>ok</tt>.</p>
+<p>If a computation/processing task -triggered by a workflow- fails to
complete successfully, its transitions to <tt>error</tt>.</p>
+<p>If a computation/processing task exits in error, there
computation/processing task must provide <tt>error-code</tt> and
<tt>error-message</tt> information to Oozie. This information can be used from
<tt>decision</tt> nodes to implement a fine grain error handling at workflow
application level.</p>
+<p>Each action type must clearly define all the error codes it can
produce.</p></div>
+<div class="section">
+<h5><a name="a3.2.1.4_Action_Recovery"></a>3.2.1.4 Action Recovery</h5>
+<p>Oozie provides recovery capabilities when starting or ending actions.</p>
+<p>Once an action starts successfully Oozie will not retry starting the action
if the action fails during its execution. The assumption is that the external
system (i.e. Hadoop) executing the action has enough resilience to recover jobs
once it has started (i.e. Hadoop task retries).</p>
+<p>Java actions are a special case with regard to retries. Although Oozie
itself does not retry Java actions should they fail after they have
successfully started, Hadoop itself can cause the action to be restarted due to
a map task retry on the map task running the Java application. See the Java
Action section below for more detail.</p>
+<p>For failures that occur prior to the start of the job, Oozie will have
different recovery strategies depending on the nature of the failure.</p>
+<p>If the failure is of transient nature, Oozie will perform retries after a
pre-defined time interval. The number of retries and timer interval for a type
of action must be pre-configured at Oozie level. Workflow jobs can override
such configuration.</p>
+<p>Examples of a transient failures are network problems or a remote system
temporary unavailable.</p>
+<p>If the failure is of non-transient nature, Oozie will suspend the workflow
job until an manual or programmatic intervention resumes the workflow job and
the action start or end is retried. It is the responsibility of an
administrator or an external managing system to perform any necessary cleanup
before resuming the workflow job.</p>
+<p>If the failure is an error and a retry will not resolve the problem, Oozie
will perform the error transition for the action.</p>
+<p><a name="MapReduceAction"></a></p></div></div>
+<div class="section">
+<h4><a name="a3.2.2_Map-Reduce_Action"></a>3.2.2 Map-Reduce Action</h4>
+<p>The <tt>map-reduce</tt> action starts a Hadoop map/reduce job from a
workflow. Hadoop jobs can be Java Map/Reduce jobs or streaming jobs.</p>
+<p>A <tt>map-reduce</tt> action can be configured to perform file system
cleanup and directory creation before starting the map reduce job. This
capability enables Oozie to retry a Hadoop job in the situation of a transient
failure (Hadoop checks the non-existence of the job output directory and then
creates it when the Hadoop job is starting, thus a retry without cleanup of the
job output directory would fail).</p>
+<p>The workflow job will wait until the Hadoop map/reduce job completes before
continuing to the next action in the workflow execution path.</p>
+<p>The counters of the Hadoop job and job exit status (<tt>FAILED</tt>,
<tt>KILLED</tt> or <tt>SUCCEEDED</tt>) must be available to the workflow job
after the Hadoop jobs ends. This information can be used from within decision
nodes and other actions configurations.</p>
+<p>The <tt>map-reduce</tt> action has to be configured with all the necessary
Hadoop JobConf properties to run the Hadoop map/reduce job.</p>
+<p>Hadoop JobConf properties can be specified as part of</p>
+<ul>
+
+<li>the <tt>config-default.xml</tt> or</li>
+<li>JobConf XML file bundled with the workflow application or</li>
+<li><global> tag in workflow definition or</li>
+<li>Inline <tt>map-reduce</tt> action configuration or</li>
+<li>An implementation of OozieActionConfigurator specified by the
<config-class> tag in workflow definition.</li>
+</ul>
+<p>The configuration properties are loaded in the following above order i.e.
<tt>streaming</tt>, <tt>job-xml</tt>, <tt>configuration</tt>, and
<tt>config-class</tt>, and the precedence order is later values override
earlier values.</p>
+<p>Streaming and inline property values can be parameterized (templatized)
using EL expressions.</p>
+<p>The Hadoop <tt>mapred.job.tracker</tt> and <tt>fs.default.name</tt>
properties must not be present in the job-xml and inline configuration.</p>
+<p><a name="FilesArchives"></a></p>
+<div class="section">
+<h5><a name="a3.2.2.1_Adding_Files_and_Archives_for_the_Job"></a>3.2.2.1
Adding Files and Archives for the Job</h5>
+<p>The <tt>file</tt>, <tt>archive</tt> elements make available, to map-reduce
jobs, files and archives. If the specified path is relative, it is assumed the
file or archiver are within the application directory, in the corresponding
sub-path. If the path is absolute, the file or archive it is expected in the
given absolute path.</p>
+<p>Files specified with the <tt>file</tt> element, will be symbolic links in
the home directory of the task.</p>
+<p>If a file is a native library (an ‘.so’ or a
‘.so.#’ file), it will be symlinked as and ‘.so’
file in the task running directory, thus available to the task JVM.</p>
+<p>To force a symlink for a file on the task running directory, use a
‘#’ followed by the symlink name. For example
‘mycat.sh#cat’.</p>
+<p>Refer to Hadoop distributed cache documentation for details more details on
files and archives.</p></div>
+<div class="section">
+<h5><a
name="a3.2.2.2_Configuring_the_MapReduce_action_with_Java_code"></a>3.2.2.2
Configuring the MapReduce action with Java code</h5>
+<p>Java code can be used to further configure the MapReduce action. This can
be useful if you already have “driver” code for your MapReduce
action, if you’re more familiar with MapReduce’s Java API, if
there’s some configuration that requires logic, or some configuration
that’s difficult to do in straight XML (e.g. Avro).</p>
+<p>Create a class that implements the
org.apache.oozie.action.hadoop.OozieActionConfigurator interface from the
“oozie-sharelib-oozie” artifact. It contains a single method
that receives a <tt>JobConf</tt> as an argument. Any configuration properties
set on this <tt>JobConf</tt> will be used by the MapReduce action.</p>
+<p>The OozieActionConfigurator has this signature:</p>
+
+<div>
+<div>
+<pre class="source">public interface OozieActionConfigurator {
+ public void configure(JobConf actionConf) throws
OozieActionConfiguratorException;
+}
+</pre></div></div>
+
+<p>where <tt>actionConf</tt> is the <tt>JobConf</tt> you can update. If you
need to throw an Exception, you can wrap it in an
<tt>OozieActionConfiguratorException</tt>, also in the
“oozie-sharelib-oozie” artifact.</p>
+<p>For example:</p>
+
+<div>
+<div>
+<pre class="source">package com.example;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.FileInputFormat;
+import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.oozie.action.hadoop.OozieActionConfigurator;
+import org.apache.oozie.action.hadoop.OozieActionConfiguratorException;
+import org.apache.oozie.example.SampleMapper;
+import org.apache.oozie.example.SampleReducer;
+
+public class MyConfigClass implements OozieActionConfigurator {
+
+ @Override
+ public void configure(JobConf actionConf) throws
OozieActionConfiguratorException {
+ if (actionConf.getUser() == null) {
+ throw new OozieActionConfiguratorException("No user
set");
+ }
+ actionConf.setMapperClass(SampleMapper.class);
+ actionConf.setReducerClass(SampleReducer.class);
+ FileInputFormat.setInputPaths(actionConf, new Path("/user/"
+ actionConf.getUser() + "/input-data"));
+ FileOutputFormat.setOutputPath(actionConf, new Path("/user/"
+ actionConf.getUser() + "/output"));
+ ...
+ }
+}
+</pre></div></div>
+
+<p>To use your config class in your MapReduce action, simply compile it into a
jar, make the jar available to your action, and specify the class name in the
<tt>config-class</tt> element (this requires at least schema 0.5):</p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="[NODE-NAME]">
+ <map-reduce>
+ ...
+ <job-xml>[JOB-XML-FILE]</job-xml>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ <config-class>com.example.MyConfigClass</config-class>
+ ...
+ </map-reduce>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>Another example of this can be found in the “map-reduce”
example that comes with Oozie.</p>
+<p>A useful tip: The initial <tt>JobConf</tt> passed to the <tt>configure</tt>
method includes all of the properties listed in the <tt>configuration</tt>
section of the MR action in a workflow. If you need to pass any information to
your OozieActionConfigurator, you can simply put them here.</p>
+<p><a name="StreamingMapReduceAction"></a></p></div>
+<div class="section">
+<h5><a name="a3.2.2.3_Streaming"></a>3.2.2.3 Streaming</h5>
+<p>Streaming information can be specified in the <tt>streaming</tt>
element.</p>
+<p>The <tt>mapper</tt> and <tt>reducer</tt> elements are used to specify the
executable/script to be used as mapper and reducer.</p>
+<p>User defined scripts must be bundled with the workflow application and they
must be declared in the <tt>files</tt> element of the streaming configuration.
If the are not declared in the <tt>files</tt> element of the configuration it
is assumed they will be available (and in the command PATH) of the Hadoop slave
machines.</p>
+<p>Some streaming jobs require Files found on HDFS to be available to the
mapper/reducer scripts. This is done using the <tt>file</tt> and
<tt>archive</tt> elements described in the previous section.</p>
+<p>The Mapper/Reducer can be overridden by a <tt>mapred.mapper.class</tt> or
<tt>mapred.reducer.class</tt> properties in the <tt>job-xml</tt> file or
<tt>configuration</tt> elements.</p>
+<p><a name="PipesMapReduceAction"></a></p></div>
+<div class="section">
+<h5><a name="a3.2.2.4_Pipes"></a>3.2.2.4 Pipes</h5>
+<p>Pipes information can be specified in the <tt>pipes</tt> element.</p>
+<p>A subset of the command line options which can be used while using the
Hadoop Pipes Submitter can be specified via elements - <tt>map</tt>,
<tt>reduce</tt>, <tt>inputformat</tt>, <tt>partitioner</tt>, <tt>writer</tt>,
<tt>program</tt>.</p>
+<p>The <tt>program</tt> element is used to specify the executable/script to be
used.</p>
+<p>User defined program must be bundled with the workflow application.</p>
+<p>Some pipe jobs require Files found on HDFS to be available to the
mapper/reducer scripts. This is done using the <tt>file</tt> and
<tt>archive</tt> elements described in the previous section.</p>
+<p>Pipe properties can be overridden by specifying them in the
<tt>job-xml</tt> file or <tt>configuration</tt> element.</p></div>
+<div class="section">
+<h5><a name="a3.2.2.5_Syntax"></a>3.2.2.5 Syntax</h5>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="[NODE-NAME]">
+ <map-reduce>
+ <resource-manager>[RESOURCE-MANAGER]</resource-manager>
+ <name-node>[NAME-NODE]</name-node>
+ <prepare>
+ <delete path="[PATH]"/>
+ ...
+ <mkdir path="[PATH]"/>
+ ...
+ </prepare>
+ <streaming>
+ <mapper>[MAPPER-PROCESS]</mapper>
+ <reducer>[REDUCER-PROCESS]</reducer>
+
<record-reader>[RECORD-READER-CLASS]</record-reader>
+
<record-reader-mapping>[NAME=VALUE]</record-reader-mapping>
+ ...
+ <env>[NAME=VALUE]</env>
+ ...
+ </streaming>
+ <!-- Either streaming or pipes can be specified for
an action, not both -->
+ <pipes>
+ <map>[MAPPER]</map>
+ <reduce>[REDUCER]</reducer>
+ <inputformat>[INPUTFORMAT]</inputformat>
+ <partitioner>[PARTITIONER]</partitioner>
+ <writer>[OUTPUTFORMAT]</writer>
+ <program>[EXECUTABLE]</program>
+ </pipes>
+ <job-xml>[JOB-XML-FILE]</job-xml>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ <config-class>com.example.MyConfigClass</config-class>
+ <file>[FILE-PATH]</file>
+ ...
+ <archive>[FILE-PATH]</archive>
+ ...
+ </map-reduce>
+
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>prepare</tt> element, if present, indicates a list of paths to
delete before starting the job. This should be used exclusively for directory
cleanup or dropping of hcatalog table or table partitions for the job to be
executed. The delete operation will be performed in the
<tt>fs.default.name</tt> filesystem for hdfs URIs. The format for specifying
hcatalog table URI is hcat://[metastore server]:[port]/[database name]/[table
name] and format to specify a hcatalog table partition URI is
<tt>hcat://[metastore server]:[port]/[database name]/[table
name]/[partkey1]=[value];[partkey2]=[value]</tt>. In case of a hcatalog URI,
the hive-site.xml needs to be shipped using <tt>file</tt> tag and the hcatalog
and hive jars need to be placed in workflow lib directory or specified using
<tt>archive</tt> tag.</p>
+<p>The <tt>job-xml</tt> element, if present, must refer to a Hadoop JobConf
<tt>job.xml</tt> file bundled in the workflow application. By default the
<tt>job.xml</tt> file is taken from the workflow application namenode,
regardless the namenode specified for the action. To specify a <tt>job.xml</tt>
on another namenode use a fully qualified file path. The <tt>job-xml</tt>
element is optional and as of schema 0.4, multiple <tt>job-xml</tt> elements
are allowed in order to specify multiple Hadoop JobConf <tt>job.xml</tt>
files.</p>
+<p>The <tt>configuration</tt> element, if present, contains JobConf properties
for the Hadoop job.</p>
+<p>Properties specified in the <tt>configuration</tt> element override
properties specified in the file specified in the <tt>job-xml</tt> element.</p>
+<p>As of schema 0.5, the <tt>config-class</tt> element, if present, contains a
class that implements OozieActionConfigurator that can be used to further
configure the MapReduce job.</p>
+<p>Properties specified in the <tt>config-class</tt> class override properties
specified in <tt>configuration</tt> element.</p>
+<p>External Stats can be turned on/off by specifying the property
<i>oozie.action.external.stats.write</i> as <i>true</i> or <i>false</i> in the
configuration element of workflow.xml. The default value for this property is
<i>false</i>.</p>
+<p>The <tt>file</tt> element, if present, must specify the target symbolic
link for binaries by separating the original file and target with a #
(file#target-sym-link). This is not required for libraries.</p>
+<p>The <tt>mapper</tt> and <tt>reducer</tt> process for streaming jobs, should
specify the executable command with URL encoding. e.g. ‘%’ should
be replaced by ‘%25’.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="foo-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="myfirstHadoopJob">
+ <map-reduce>
+ <resource-manager>foo:8032</resource-manager>
+ <name-node>bar:8020</name-node>
+ <prepare>
+ <delete
path="hdfs://foo:8020/usr/tucu/output-data"/>
+ </prepare>
+ <job-xml>/myfirstjob.xml</job-xml>
+ <configuration>
+ <property>
+ <name>mapred.input.dir</name>
+ <value>/usr/tucu/input-data</value>
+ </property>
+ <property>
+ <name>mapred.output.dir</name>
+ <value>/usr/tucu/input-data</value>
+ </property>
+ <property>
+ <name>mapred.reduce.tasks</name>
+ <value>${firstJobReducers}</value>
+ </property>
+ <property>
+ <name>oozie.action.external.stats.write</name>
+ <value>true</value>
+ </property>
+ </configuration>
+ </map-reduce>
+ <ok to="myNextAction"/>
+ <error to="errorCleanup"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>In the above example, the number of Reducers to be used by the Map/Reduce
job has to be specified as a parameter of the workflow job configuration when
creating the workflow job.</p>
+<p><b>Streaming Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="firstjob">
+ <map-reduce>
+ <resource-manager>foo:8032</resource-manager>
+ <name-node>bar:8020</name-node>
+ <prepare>
+ <delete path="${output}"/>
+ </prepare>
+ <streaming>
+ <mapper>/bin/bash testarchive/bin/mapper.sh
testfile</mapper>
+ <reducer>/bin/bash
testarchive/bin/reducer.sh</reducer>
+ </streaming>
+ <configuration>
+ <property>
+ <name>mapred.input.dir</name>
+ <value>${input}</value>
+ </property>
+ <property>
+ <name>mapred.output.dir</name>
+ <value>${output}</value>
+ </property>
+ <property>
+ <name>stream.num.map.output.key.fields</name>
+ <value>3</value>
+ </property>
+ </configuration>
+ <file>/users/blabla/testfile.sh#testfile</file>
+
<archive>/users/blabla/testarchive.jar#testarchive</archive>
+ </map-reduce>
+ <ok to="end"/>
+ <error to="kill"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><b>Pipes Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="firstjob">
+ <map-reduce>
+ <resource-manager>foo:8032</resource-manager>
+ <name-node>bar:8020</name-node>
+ <prepare>
+ <delete path="${output}"/>
+ </prepare>
+ <pipes>
+
<program>bin/wordcount-simple#wordcount-simple</program>
+ </pipes>
+ <configuration>
+ <property>
+ <name>mapred.input.dir</name>
+ <value>${input}</value>
+ </property>
+ <property>
+ <name>mapred.output.dir</name>
+ <value>${output}</value>
+ </property>
+ </configuration>
+
<archive>/users/blabla/testarchive.jar#testarchive</archive>
+ </map-reduce>
+ <ok to="end"/>
+ <error to="kill"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><a name="PigAction"></a></p></div></div>
+<div class="section">
+<h4><a name="a3.2.3_Pig_Action"></a>3.2.3 Pig Action</h4>
+<p>The <tt>pig</tt> action starts a Pig job.</p>
+<p>The workflow job will wait until the pig job completes before continuing to
the next action.</p>
+<p>The <tt>pig</tt> action has to be configured with the resource-manager,
name-node, pig script and the necessary parameters and configuration to run the
Pig job.</p>
+<p>A <tt>pig</tt> action can be configured to perform HDFS files/directories
cleanup or HCatalog partitions cleanup before starting the Pig job. This
capability enables Oozie to retry a Pig job in the situation of a transient
failure (Pig creates temporary directories for intermediate data, thus a retry
without cleanup would fail).</p>
+<p>Hadoop JobConf properties can be specified as part of</p>
+<ul>
+
+<li>the <tt>config-default.xml</tt> or</li>
+<li>JobConf XML file bundled with the workflow application or</li>
+<li><global> tag in workflow definition or</li>
+<li>Inline <tt>pig</tt> action configuration.</li>
+</ul>
+<p>The configuration properties are loaded in the following above order i.e.
<tt>job-xml</tt> and <tt>configuration</tt>, and the precedence order is later
values override earlier values.</p>
+<p>Inline property values can be parameterized (templatized) using EL
expressions.</p>
+<p>The YARN <tt>yarn.resourcemanager.address</tt> and HDFS
<tt>fs.default.name</tt> properties must not be present in the job-xml and
inline configuration.</p>
+<p>As with Hadoop map-reduce jobs, it is possible to add files and archives
to be available to the Pig job, refer to section [#FilesArchives][Adding Files
and Archives for the Job].</p>
+<p><b>Syntax for Pig actions in Oozie schema 1.0:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="[NODE-NAME]">
+ <pig>
+ <resource-manager>[RESOURCE-MANAGER]</resource-manager>
+ <name-node>[NAME-NODE]</name-node>
+ <prepare>
+ <delete path="[PATH]"/>
+ ...
+ <mkdir path="[PATH]"/>
+ ...
+ </prepare>
+ <job-xml>[JOB-XML-FILE]</job-xml>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ <script>[PIG-SCRIPT]</script>
+ <param>[PARAM-VALUE]</param>
+ ...
+ <param>[PARAM-VALUE]</param>
+ <argument>[ARGUMENT-VALUE]</argument>
+ ...
+ <argument>[ARGUMENT-VALUE]</argument>
+ <file>[FILE-PATH]</file>
+ ...
+ <archive>[FILE-PATH]</archive>
+ ...
+ </pig>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><b>Syntax for Pig actions in Oozie schema 0.2:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:0.2">
+ ...
+ <action name="[NODE-NAME]">
+ <pig>
+ <job-tracker>[JOB-TRACKER]</job-tracker>
+ <name-node>[NAME-NODE]</name-node>
+ <prepare>
+ <delete path="[PATH]"/>
+ ...
+ <mkdir path="[PATH]"/>
+ ...
+ </prepare>
+ <job-xml>[JOB-XML-FILE]</job-xml>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ <script>[PIG-SCRIPT]</script>
+ <param>[PARAM-VALUE]</param>
+ ...
+ <param>[PARAM-VALUE]</param>
+ <argument>[ARGUMENT-VALUE]</argument>
+ ...
+ <argument>[ARGUMENT-VALUE]</argument>
+ <file>[FILE-PATH]</file>
+ ...
+ <archive>[FILE-PATH]</archive>
+ ...
+ </pig>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><b>Syntax for Pig actions in Oozie schema 0.1:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:0.1">
+ ...
+ <action name="[NODE-NAME]">
+ <pig>
+ <job-tracker>[JOB-TRACKER]</job-tracker>
+ <name-node>[NAME-NODE]</name-node>
+ <prepare>
+ <delete path="[PATH]"/>
+ ...
+ <mkdir path="[PATH]"/>
+ ...
+ </prepare>
+ <job-xml>[JOB-XML-FILE]</job-xml>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ <script>[PIG-SCRIPT]</script>
+ <param>[PARAM-VALUE]</param>
+ ...
+ <param>[PARAM-VALUE]</param>
+ <file>[FILE-PATH]</file>
+ ...
+ <archive>[FILE-PATH]</archive>
+ ...
+ </pig>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>prepare</tt> element, if present, indicates a list of paths to
delete before starting the job. This should be used exclusively for directory
cleanup or dropping of hcatalog table or table partitions for the job to be
executed. The delete operation will be performed in the
<tt>fs.default.name</tt> filesystem for hdfs URIs. The format for specifying
hcatalog table URI is hcat://[metastore server]:[port]/[database name]/[table
name] and format to specify a hcatalog table partition URI is
<tt>hcat://[metastore server]:[port]/[database name]/[table
name]/[partkey1]=[value];[partkey2]=[value]</tt>. In case of a hcatalog URI,
the hive-site.xml needs to be shipped using <tt>file</tt> tag and the hcatalog
and hive jars need to be placed in workflow lib directory or specified using
<tt>archive</tt> tag.</p>
+<p>The <tt>job-xml</tt> element, if present, must refer to a Hadoop JobConf
<tt>job.xml</tt> file bundled in the workflow application. The <tt>job-xml</tt>
element is optional and as of schema 0.4, multiple <tt>job-xml</tt> elements
are allowed in order to specify multiple Hadoop JobConf <tt>job.xml</tt>
files.</p>
+<p>The <tt>configuration</tt> element, if present, contains JobConf properties
for the underlying Hadoop jobs.</p>
+<p>Properties specified in the <tt>configuration</tt> element override
properties specified in the file specified in the <tt>job-xml</tt> element.</p>
+<p>External Stats can be turned on/off by specifying the property
<i>oozie.action.external.stats.write</i> as <i>true</i> or <i>false</i> in the
configuration element of workflow.xml. The default value for this property is
<i>false</i>.</p>
+<p>The inline and job-xml configuration properties are passed to the Hadoop
jobs submitted by Pig runtime.</p>
+<p>The <tt>script</tt> element contains the pig script to execute. The pig
script can be templatized with variables of the form <tt>${VARIABLE}</tt>. The
values of these variables can then be specified using the <tt>params</tt>
element.</p>
+<p>NOTE: Oozie will perform the parameter substitution before firing the pig
job. This is different from the <a class="externalLink"
href="http://wiki.apache.org/pig/ParameterSubstitution">parameter substitution
mechanism provided by Pig</a>, which has a few limitations.</p>
+<p>The <tt>params</tt> element, if present, contains parameters to be passed
to the pig script.</p>
+<p><b>In Oozie schema 0.2:</b> The <tt>arguments</tt> element, if present,
contains arguments to be passed to the pig script.</p>
+<p>All the above elements can be parameterized (templatized) using EL
expressions.</p>
+<p><b>Example for Oozie schema 0.2:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:0.2">
+ ...
+ <action name="myfirstpigjob">
+ <pig>
+ <job-tracker>foo:8021</job-tracker>
+ <name-node>bar:8020</name-node>
+ <prepare>
+ <delete path="${jobOutput}"/>
+ </prepare>
+ <configuration>
+ <property>
+ <name>mapred.compress.map.output</name>
+ <value>true</value>
+ </property>
+ <property>
+ <name>oozie.action.external.stats.write</name>
+ <value>true</value>
+ </property>
+ </configuration>
+ <script>/mypigscript.pig</script>
+ <argument>-param</argument>
+ <argument>INPUT=${inputDir}</argument>
+ <argument>-param</argument>
+ <argument>OUTPUT=${outputDir}/pig-output3</argument>
+ </pig>
+ <ok to="myotherjob"/>
+ <error to="errorcleanup"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><b>Example for Oozie schema 0.1:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:0.1">
+ ...
+ <action name="myfirstpigjob">
+ <pig>
+ <job-tracker>foo:8021</job-tracker>
+ <name-node>bar:8020</name-node>
+ <prepare>
+ <delete path="${jobOutput}"/>
+ </prepare>
+ <configuration>
+ <property>
+ <name>mapred.compress.map.output</name>
+ <value>true</value>
+ </property>
+ </configuration>
+ <script>/mypigscript.pig</script>
+ <param>InputDir=/home/tucu/input-data</param>
+ <param>OutputDir=${jobOutput}</param>
+ </pig>
+ <ok to="myotherjob"/>
+ <error to="errorcleanup"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><a name="FsAction"></a></p></div>
+<div class="section">
+<h4><a name="a3.2.4_Fs_HDFS_action"></a>3.2.4 Fs (HDFS) action</h4>
+<p>The <tt>fs</tt> action allows to manipulate files and directories in HDFS
from a workflow application. The supported commands are <tt>move</tt>,
<tt>delete</tt>, <tt>mkdir</tt>, <tt>chmod</tt>, <tt>touchz</tt>,
<tt>setrep</tt> and <tt>chgrp</tt>.</p>
+<p>The FS commands are executed synchronously from within the FS action, the
workflow job will wait until the specified file commands are completed before
continuing to the next action.</p>
+<p>Path names specified in the <tt>fs</tt> action can be parameterized
(templatized) using EL expressions. Path name should be specified as a absolute
path. In case of <tt>move</tt>, <tt>delete</tt>, <tt>chmod</tt> and
<tt>chgrp</tt> commands, a glob pattern can also be specified instead of an
absolute path. For <tt>move</tt>, glob pattern can only be specified for source
path and not the target.</p>
+<p>Each file path must specify the file system URI, for move operations, the
target must not specify the system URI.</p>
+<p><b>IMPORTANT:</b> For the purposes of copying files within a cluster it is
recommended to refer to the <tt>distcp</tt> action instead. Refer to <a
href="DG_DistCpActionExtension.html"><tt>distcp</tt></a> action to copy files
within a cluster.</p>
+<p><b>IMPORTANT:</b> All the commands within <tt>fs</tt> action do not happen
atomically, if a <tt>fs</tt> action fails half way in the commands being
executed, successfully executed commands are not rolled back. The <tt>fs</tt>
action, before executing any command must check that source paths exist and
target paths don’t exist (constraint regarding target relaxed for the
<tt>move</tt> action. See below for details), thus failing before executing any
command. Therefore the validity of all paths specified in one <tt>fs</tt>
action are evaluated before any of the file operation are executed. Thus there
is less chance of an error occurring while the <tt>fs</tt> action executes.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="[NODE-NAME]">
+ <fs>
+ <delete path='[PATH]' skip-trash='[true/false]'/>
+ ...
+ <mkdir path='[PATH]'/>
+ ...
+ <move source='[SOURCE-PATH]' target='[TARGET-PATH]'/>
+ ...
+ <chmod path='[PATH]' permissions='[PERMISSIONS]'
dir-files='false' />
+ ...
+ <touchz path='[PATH]' />
+ ...
+ <chgrp path='[PATH]' group='[GROUP]' dir-files='false' />
+ ...
+ <setrep path='[PATH]' replication-factor='2'/>
+ </fs>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>delete</tt> command deletes the specified path, if it is a
directory it deletes recursively all its content and then deletes the
directory. By default it does skip trash. It can be moved to trash by setting
the value of skip-trash to ‘false’. It can also be used to drop
hcat tables/partitions. This is the only FS command which supports HCatalog
URIs as well. For eg:</p>
+
+<div>
+<div>
+<pre class="source"><delete path='hcat://[metastore
server]:[port]/[database name]/[table name]'/>
+OR
+<delete path='hcat://[metastore server]:[port]/[database name]/[table
name]/[partkey1]=[value];[partkey2]=[value];...'/>
+</pre></div></div>
+
+<p>The <tt>mkdir</tt> command creates the specified directory, it creates all
missing directories in the path. If the directory already exist it does a
no-op.</p>
+<p>In the <tt>move</tt> command the <tt>source</tt> path must exist. The
following scenarios are addressed for a <tt>move</tt>:</p>
+<ul>
+
+<li>The file system URI(e.g. <tt>hdfs://{nameNode}</tt>) can be skipped in the
<tt>target</tt> path. It is understood to be the same as that of the source.
But if the target path does contain the system URI, it cannot be different than
that of the source.</li>
+<li>The parent directory of the <tt>target</tt> path must exist</li>
+<li>For the <tt>target</tt> path, if it is a file, then it must not already
exist.</li>
+<li>However, if the <tt>target</tt> path is an already existing directory, the
<tt>move</tt> action will place your <tt>source</tt> as a child of the
<tt>target</tt> directory.</li>
+</ul>
+<p>The <tt>chmod</tt> command changes the permissions for the specified path.
Permissions can be specified using the Unix Symbolic representation (e.g.
-rwxrw-rw-) or an octal representation (755). When doing a <tt>chmod</tt>
command on a directory, by default the command is applied to the directory and
the files one level within the directory. To apply the <tt>chmod</tt> command
to the directory, without affecting the files within it, the <tt>dir-files</tt>
attribute must be set to <tt>false</tt>. To apply the <tt>chmod</tt> command
recursively to all levels within a directory, put a <tt>recursive</tt> element
inside the <chmod> element.</p>
+<p>The <tt>touchz</tt> command creates a zero length file in the specified
path if none exists. If one already exists, then touchz will perform a touch
operation. Touchz works only for absolute paths.</p>
+<p>The <tt>chgrp</tt> command changes the group for the specified path. When
doing a <tt>chgrp</tt> command on a directory, by default the command is
applied to the directory and the files one level within the directory. To apply
the <tt>chgrp</tt> command to the directory, without affecting the files within
it, the <tt>dir-files</tt> attribute must be set to <tt>false</tt>. To apply
the <tt>chgrp</tt> command recursively to all levels within a directory, put a
<tt>recursive</tt> element inside the <chgrp> element.</p>
+<p>The <tt>setrep</tt> command changes replication factor of an hdfs file(s).
Changing RF of directories or symlinks is not supported; this action requires
an argument for RF.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="hdfscommands">
+ <fs>
+ <delete path='hdfs://foo:8020/usr/tucu/temp-data'/>
+ <mkdir path='archives/${wf:id()}'/>
+ <move source='${jobInput}'
target='archives/${wf:id()}/processed-input'/>
+ <chmod path='${jobOutput}' permissions='-rwxrw-rw-'
dir-files='true'><recursive/></chmod>
+ <chgrp path='${jobOutput}' group='testgroup'
dir-files='true'><recursive/></chgrp>
+ <setrep path='archives/${wf:id()/filename(s)}'
replication-factor='2'/>
+ </fs>
+ <ok to="myotherjob"/>
+ <error to="errorcleanup"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>In the above example, a directory named after the workflow job ID is
created and the input of the job, passed as workflow configuration parameter,
is archived under the previously created directory.</p>
+<p>As of schema 0.4, if a <tt>name-node</tt> element is specified, then it is
not necessary for any of the paths to start with the file system URI as it is
taken from the <tt>name-node</tt> element. This is also true if the name-node
is specified in the global section (see <a
href="WorkflowFunctionalSpec.html#GlobalConfigurations">Global
Configurations</a>)</p>
+<p>As of schema 0.4, zero or more <tt>job-xml</tt> elements can be specified;
these must refer to Hadoop JobConf <tt>job.xml</tt> formatted files bundled in
the workflow application. They can be used to set additional properties for the
FileSystem instance.</p>
+<p>As of schema 0.4, if a <tt>configuration</tt> element is specified, then it
will also be used to set additional JobConf properties for the FileSystem
instance. Properties specified in the <tt>configuration</tt> element override
properties specified in the files specified by any <tt>job-xml</tt>
elements.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:0.4">
+ ...
+ <action name="hdfscommands">
+ <fs>
+ <name-node>hdfs://foo:8020</name-node>
+ <job-xml>fs-info.xml</job-xml>
+ <configuration>
+ <property>
+ <name>some.property</name>
+ <value>some.value</value>
+ </property>
+ </configuration>
+ <delete path='/usr/tucu/temp-data'/>
+ </fs>
+ <ok to="myotherjob"/>
+ <error to="errorcleanup"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p><a name="SubWorkflowAction"></a></p></div>
+<div class="section">
+<h4><a name="a3.2.5_Sub-workflow_Action"></a>3.2.5 Sub-workflow Action</h4>
+<p>The <tt>sub-workflow</tt> action runs a child workflow job.</p>
+<p>The parent workflow job will wait until the child workflow job has
completed.</p>
+<p>There can be several sub-workflows defined within a single workflow, each
under its own action element.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="[NODE-NAME]">
+ <sub-workflow>
+ <app-path>[WF-APPLICATION-PATH]</app-path>
+ <propagate-configuration/>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ </sub-workflow>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The child workflow job runs in the same Oozie system instance where the
parent workflow job is running.</p>
+<p>The <tt>app-path</tt> element specifies the path to the workflow
application of the child workflow job.</p>
+<p>The <tt>propagate-configuration</tt> flag, if present, indicates that the
workflow job configuration should be propagated to the child workflow.</p>
+<p>The <tt>configuration</tt> section can be used to specify the job
properties that are required to run the child workflow job.</p>
+<p>The configuration of the <tt>sub-workflow</tt> action can be parameterized
(templatized) using EL expressions.</p>
+<p><b>Example:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="a">
+ <sub-workflow>
+ <app-path>child-wf</app-path>
+ <configuration>
+ <property>
+ <name>input.dir</name>
+ <value>${wf:id()}/second-mr-output</value>
+ </property>
+ </configuration>
+ </sub-workflow>
+ <ok to="end"/>
+ <error to="kill"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>In the above example, the workflow definition with the name
<tt>child-wf</tt> will be run on the Oozie instance at
<tt>.http://myhost:11000/oozie</tt>. The specified workflow application must be
already deployed on the target Oozie instance.</p>
+<p>A configuration parameter <tt>input.dir</tt> is being passed as job
property to the child workflow job.</p>
+<p>The subworkflow can inherit the lib jars from the parent workflow by
setting <tt>oozie.subworkflow.classpath.inheritance</tt> to true in
oozie-site.xml or on a per-job basis by setting
<tt>oozie.wf.subworkflow.classpath.inheritance</tt> to true in a job.properties
file. If both are specified,
<tt>oozie.wf.subworkflow.classpath.inheritance</tt> has priority. If the
subworkflow and the parent have conflicting jars, the subworkflow’s jar
has priority. By default, <tt>oozie.wf.subworkflow.classpath.inheritance</tt>
is set to false.</p>
+<p>To prevent errant workflows from starting infinitely recursive
subworkflows, <tt>oozie.action.subworkflow.max.depth</tt> can be specified in
oozie-site.xml to set the maximum depth of subworkflow calls. For example, if
set to 3, then a workflow can start subwf1, which can start subwf2, which can
start subwf3; but if subwf3 tries to start subwf4, then the action will fail.
The default is 50.</p>
+<p><a name="JavaAction"></a></p></div>
+<div class="section">
+<h4><a name="a3.2.6_Java_Action"></a>3.2.6 Java Action</h4>
+<p>The <tt>java</tt> action will execute the <tt>public static void
main(String[] args)</tt> method of the specified main Java class.</p>
+<p>Java applications are executed in the Hadoop cluster as map-reduce job with
a single Mapper task.</p>
+<p>The workflow job will wait until the java application completes its
execution before continuing to the next action.</p>
+<p>The <tt>java</tt> action has to be configured with the resource-manager,
name-node, main Java class, JVM options and arguments.</p>
+<p>To indicate an <tt>ok</tt> action transition, the main Java class must
complete gracefully the <tt>main</tt> method invocation.</p>
+<p>To indicate an <tt>error</tt> action transition, the main Java class must
throw an exception.</p>
+<p>The main Java class can call <tt>System.exit(int n)</tt>. Exit code zero is
regarded as OK, while non-zero exit codes will cause the <tt>java</tt> action
to do an <tt>error</tt> transition and exit.</p>
+<p>A <tt>java</tt> action can be configured to perform HDFS files/directories
cleanup or HCatalog partitions cleanup before starting the Java application.
This capability enables Oozie to retry a Java application in the situation of a
transient or non-transient failure (This can be used to cleanup any temporary
data which may have been created by the Java application in case of
failure).</p>
+<p>A <tt>java</tt> action can create a Hadoop configuration for interacting
with a cluster (e.g. launching a map-reduce job). Oozie prepares a Hadoop
configuration file which includes the environments site configuration files
(e.g. hdfs-site.xml, mapred-site.xml, etc) plus the properties added to the
<tt><configuration></tt> section of the <tt>java</tt> action. The Hadoop
configuration file is made available as a local file to the Java application in
its running directory. It can be added to the <tt>java</tt> actions Hadoop
configuration by referencing the system property:
<tt>oozie.action.conf.xml</tt>. For example:</p>
+
+<div>
+<div>
+<pre class="source">// loading action conf prepared by Oozie
+Configuration actionConf = new Configuration(false);
+actionConf.addResource(new Path("file:///",
System.getProperty("oozie.action.conf.xml")));
+</pre></div></div>
+
+<p>If <tt>oozie.action.conf.xml</tt> is not added then the job will pick up
the mapred-default properties and this may result in unexpected behaviour. For
repeated configuration properties later values override earlier ones.</p>
+<p>Inline property values can be parameterized (templatized) using EL
expressions.</p>
+<p>The YARN <tt>yarn.resourcemanager.address</tt> (<tt>resource-manager</tt>)
and HDFS <tt>fs.default.name</tt> (<tt>name-node</tt>) properties must not be
present in the <tt>job-xml</tt> and in the inline configuration.</p>
+<p>As with <tt>map-reduce</tt> and <tt>pig</tt> actions, it is possible to
add files and archives to be available to the Java application. Refer to
section [#FilesArchives][Adding Files and Archives for the Job].</p>
+<p>The <tt>capture-output</tt> element can be used to propagate values back
into Oozie context, which can then be accessed via EL-functions. This needs to
be written out as a java properties format file. The filename is obtained via a
System property specified by the constant
<tt>oozie.action.output.properties</tt></p>
+<p><b>IMPORTANT:</b> In order for a Java action to succeed on a secure
cluster, it must propagate the Hadoop delegation token like in the following
code snippet (this is benign on non-secure clusters):</p>
+
+<div>
+<div>
+<pre class="source">// propagate delegation related props from launcher job to
MR job
+if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) {
+ jobConf.set("mapreduce.job.credentials.binary",
System.getenv("HADOOP_TOKEN_FILE_LOCATION"));
+}
+</pre></div></div>
+
+<p><b>IMPORTANT:</b> Because the Java application is run from within a
Map-Reduce job, from Hadoop 0.20. onwards a queue must be assigned to it. The
queue name must be specified as a configuration property.</p>
+<p><b>IMPORTANT:</b> The Java application from a Java action is executed in a
single map task. If the task is abnormally terminated, such as due to a
TaskTracker restart (e.g. during cluster maintenance), the task will be retried
via the normal Hadoop task retry mechanism. To avoid workflow failure, the
application should be written in a fashion that is resilient to such retries,
for example by detecting and deleting incomplete outputs or picking back up
from complete outputs. Furthermore, if a Java action spawns asynchronous
activity outside the JVM of the action itself (such as by launching additional
MapReduce jobs), the application must consider the possibility of collisions
with activity spawned by the new instance.</p>
+<p><b>Syntax:</b></p>
+
+<div>
+<div>
+<pre class="source"><workflow-app name="[WF-DEF-NAME]"
xmlns="uri:oozie:workflow:1.0">
+ ...
+ <action name="[NODE-NAME]">
+ <java>
+ <resource-manager>[RESOURCE-MANAGER]</resource-manager>
+ <name-node>[NAME-NODE]</name-node>
+ <prepare>
+ <delete path="[PATH]"/>
+ ...
+ <mkdir path="[PATH]"/>
+ ...
+ </prepare>
+ <job-xml>[JOB-XML]</job-xml>
+ <configuration>
+ <property>
+ <name>[PROPERTY-NAME]</name>
+ <value>[PROPERTY-VALUE]</value>
+ </property>
+ ...
+ </configuration>
+ <main-class>[MAIN-CLASS]</main-class>
+ <java-opts>[JAVA-STARTUP-OPTS]</java-opts>
+ <arg>ARGUMENT</arg>
+ ...
+ <file>[FILE-PATH]</file>
+ ...
+ <archive>[FILE-PATH]</archive>
+ ...
+ <capture-output />
+ </java>
+ <ok to="[NODE-NAME]"/>
+ <error to="[NODE-NAME]"/>
+ </action>
+ ...
+</workflow-app>
+</pre></div></div>
+
+<p>The <tt>prepare</tt> element, if present, indicates a list of paths to
delete before starting the Java application. This should be used exclusively
for directory cleanup or dropping of hcatalog table or table partitions for the
Java application to be executed. In case of <tt>delete</tt>, a glob pattern can
be used to specify path. The format for specifying hcatalog table URI is
hcat://[metastore server]:[port]/[database name]/[table name] and format to
specify a hcatalog table partition URI is <tt>hcat://[metastore
server]:[port]/[database name]/[table
name]/[partkey1]=[value];[partkey2]=[value]</tt>. In case of a hcatalog URI,
the hive-site.xml needs to be shipped using <tt>file</tt> tag and the hcatalog
and hive jars need to be placed in workflow lib directory or specified using
<tt>archive</tt> tag.</p>
[... 3526 lines stripped ...]
Added: websites/staging/oozie/trunk/content/docs/5.2.0/configuration.xsl
==============================================================================
Binary file - no diff available.
Propchange: websites/staging/oozie/trunk/content/docs/5.2.0/configuration.xsl
------------------------------------------------------------------------------
svn:mime-type = application/xml