[jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)

Alejandro Abdelnur (JIRA) Mon, 23 Feb 2009 04:01:35 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675861#action_12675861
 ]


Alejandro Abdelnur commented on HADOOP-5303:
--------------------------------------------

Cascading and HWS are different beasts.

Cascading is a different way of doing what Pig does. Programming in Cascading 
is programming on a higher level abstraction that resolves in a series of 
Map/Reduce jobs.

HWS is a (server) workflow system specialized on running Hadoop/Pig jobs wired 
via a PDL descriptor.

Following a few quick highlights on how Cascading and HWS differ:

h4. Cascading uses a topological search model to resolve the execution path.

HWS uses a 'DAG of processes workflow' model that allows explicitly expressing 
parallelism and alternate execution paths (decisions).

h4. Cascading runs as a client from the command line

HWS is a server system (like Hadoop Job Tracker) to which you submit workflow 
jobs and later check the status.

In HWS there are not resources held once the client submitted the workflow job, 
the workflow job runs in the server.

This allows you to run several thousands of workflow jobs concurrently from a 
single HWS that supports system failover.

In HWS monitoring and status tracking of jobs is done via CLIs and a web 
console that gathers data from HWS (like you do in Hadoop).

h4. Cascading primary programming model is similar to PIG but with a Java API.

In Cascading you can still use your Hadoop jobs as a flow, as a way to 
integrate with existing map/reduce apps, but the real benefit of cascading is 
by using its API programming model.

HWS primary programming model are Hadoop/Pig jobs connected via a workflow 
definition PDL like XML file.

h4. In cascading you need to write Java code to wire your Hadoop jobs

In HWS you don't have to wire your Hadoop/Pig jobs in Java but in a workflow 
XML file in a more declarative way.


> Hadoop Workflow System (HWS)
> ----------------------------
>
>                 Key: HADOOP-5303
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5303
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: hws-preso-v1_0_2009FEB22.pdf, hws-v1_0_2009FEB22.pdf
>
>
> This is a proposal for a system specialized in running Hadoop/Pig jobs in a 
> control dependency DAG (Direct Acyclic Graph), a Hadoop workflow application.
> Attached there is a complete specification and a high level overview 
> presentation.
> ----
> *Highlights* 
> A Workflow application is DAG that coordinates the following types of 
> actions: Hadoop, Pig, Ssh, Http, Email and sub-workflows. 
> Flow control operations within the workflow applications can be done using 
> decision, fork and join nodes. Cycles in workflows are not supported.
> Actions and decisions can be parameterized with job properties, actions 
> output (i.e. Hadoop counters, Ssh key/value pairs output) and file 
> information (file exists, file size, etc). Formal parameters are expressed in 
> the workflow definition as {{${VAR}}} variables.
> A Workflow application is a ZIP file that contains the workflow definition 
> (an XML file), all the necessary files to run all the actions: JAR files for 
> Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Pig 
> scripts, and other resource files.
> Before running a workflow job, the corresponding workflow application must be 
> deployed in HWS.
> Deploying workflow application and running workflow jobs can be done via 
> command line tools, a WS API and a Java API.
> Monitoring the system and workflow jobs can be done via a web console, 
> command line tools, a WS API and a Java API.
> When submitting a workflow job, a set of properties resolving all the formal 
> parameters in the workflow definitions must be provided. This set of 
> properties is a Hadoop configuration.
> Possible states for a workflow jobs are: {{CREATED}}, {{RUNNING}}, 
> {{SUSPENDED}}, {{SUCCEEDED}}, {{KILLED}} and {{FAILED}}.
> In the case of a action failure in a workflow job, depending on the type of 
> failure, HWS will attempt automatic retries, it will request a manual retry 
> or it will fail the workflow job.
> HWS can make HTTP callback notifications on action start/end/failure events 
> and workflow end/failure events.
> In the case of workflow job failure, the workflow job can be resubmitted 
> skipping previously completed actions. Before doing a resubmission the 
> workflow application could be updated with a patch to fix a problem in the 
> workflow application code.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)

Reply via email to