[jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)

Alejandro Abdelnur (JIRA) Mon, 23 Feb 2009 15:52:28 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676130#action_12676130
 ]


Alejandro Abdelnur commented on HADOOP-5303:
--------------------------------------------

h4. Regarding the EL resolution,

Resolution is well defined, how and when.

{{${something}}} expressions are resolved from workflow job properties and EL 
functions at them time that a workflow job starts a workflow action (enters the 
node). Values from the HWS configuration (hws-default.xml & hws-site.xml) are 
not used to resolve workflow job properties (this is different from Hadoop).

The EL language HWS uses is JSP EL (and we use commons-el implementation). It 
allows you to support variables, functions and complex expressions. What 
inconsistency you (Steve) refer to? probably is a mistake on the spec.

h4. Regarding the choice of SQL DB for job store,

If you need HA/failover for the DB you can get it, granted you need hardware, 
but it is zero to minimal effort from HWS side.

We did not use HDFS because we need frequent workflow job context updates (ie 
every time that an action starts, ends). 

HWS keeps zero state in memory when workflow job is not doing a transition 
(this allows HWS to scale big), because of this we need index access (ie by job 
ID, action ID, user ID). A SQL DB gives very good read/write access times. 
Finally the transaction support makes very straight forward keeping things 
consistent in case of failure.

HWS uses SQL standard with no extensions from any implementation, thus it can 
run on any SQL DB (we use HSQL for unitests and MySQL when deployed).

h4. Regarding allowing uses to plug in new actions without editing the codebase 
and schema,

No need to modify HWS codebase for new action.

Your suggestion about using something like {{<action type="email">}} instead 
{{<email>}} makes sense. This would also remove the need of tweaking the 
XML-schema. Something like:

{code}
<action name="myPigjob">
  <pig xmlns="hws:pig:1">
  ...
  </pig>
</action>
{code}

By doing this schema validation of the action type can also be performed. Plus 
you could support multiple versions of an action type if needed.

h4. Regarding testing, using HSQL

Yes, we already do, testcases fly.

h4. Regarding Why XML and not JSON?

HWS uses JSON for all WS API responses.

HWS uses XML for the workflow job conf (leveraging Hadoop Configuration).

HWS uses XML for the workflow definition (PDL).

h4. Regarding Cascading comments not strictly true

I'm not a Cascading expert so I may have missed something, but I think I got 
things mostly right. Corrections please...




> Hadoop Workflow System (HWS)
> ----------------------------
>
>                 Key: HADOOP-5303
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5303
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: hws-preso-v1_0_2009FEB22.pdf, hws-v1_0_2009FEB22.pdf
>
>
> This is a proposal for a system specialized in running Hadoop/Pig jobs in a 
> control dependency DAG (Direct Acyclic Graph), a Hadoop workflow application.
> Attached there is a complete specification and a high level overview 
> presentation.
> ----
> *Highlights* 
> A Workflow application is DAG that coordinates the following types of 
> actions: Hadoop, Pig, Ssh, Http, Email and sub-workflows. 
> Flow control operations within the workflow applications can be done using 
> decision, fork and join nodes. Cycles in workflows are not supported.
> Actions and decisions can be parameterized with job properties, actions 
> output (i.e. Hadoop counters, Ssh key/value pairs output) and file 
> information (file exists, file size, etc). Formal parameters are expressed in 
> the workflow definition as {{${VAR}}} variables.
> A Workflow application is a ZIP file that contains the workflow definition 
> (an XML file), all the necessary files to run all the actions: JAR files for 
> Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Pig 
> scripts, and other resource files.
> Before running a workflow job, the corresponding workflow application must be 
> deployed in HWS.
> Deploying workflow application and running workflow jobs can be done via 
> command line tools, a WS API and a Java API.
> Monitoring the system and workflow jobs can be done via a web console, 
> command line tools, a WS API and a Java API.
> When submitting a workflow job, a set of properties resolving all the formal 
> parameters in the workflow definitions must be provided. This set of 
> properties is a Hadoop configuration.
> Possible states for a workflow jobs are: {{CREATED}}, {{RUNNING}}, 
> {{SUSPENDED}}, {{SUCCEEDED}}, {{KILLED}} and {{FAILED}}.
> In the case of a action failure in a workflow job, depending on the type of 
> failure, HWS will attempt automatic retries, it will request a manual retry 
> or it will fail the workflow job.
> HWS can make HTTP callback notifications on action start/end/failure events 
> and workflow end/failure events.
> In the case of workflow job failure, the workflow job can be resubmitted 
> skipping previously completed actions. Before doing a resubmission the 
> workflow application could be updated with a patch to fix a problem in the 
> workflow application code.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)

Reply via email to