[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

Bill Graham (JIRA) Thu, 15 Nov 2012 10:46:15 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498231#comment-13498231
 ]


Bill Graham commented on PIG-3048:
----------------------------------

I could have been mis-reading your patch but I thought it was providing the 
adjacency list only from the job in question. The DAG would include that, plus 
all adjacency lists of other jobs in the script that might not be directly 
connected to the job in question.

We use logical plan signature to distinguish between different version of the 
same workflow. Between versions the DAG could change. We've got our own custom 
field in the job conf to represent the workflow name, but it would be great to 
standardize this. So we could have something like this:

- Workflow name: the logical name of the deployed scheduled script script 
(i.e., Hourly click analysis)
- Logical plan signature (existing): a hash that represents a version of the 
script, without considering it's input/output
- Script start time (existing): used with Workflow name and Logical plan 
signature to correlate multiple jobs into a single run of a workflow
- Job start time (existing): used to show when different jobs start
- Script DAG: used by tools to visualize the current workflow execution given a 
job. This is something we'd like to have for Ambrose 
(https://github.com/twitter/ambrose).

We represent the DAG as an adjacency list keyed by the physical operator key 
(scope-*) [1] and then once a job starts we add the jobId to the node [2].

1 - 
https://github.com/twitter/ambrose/blob/master/pig/src/main/java/com/twitter/ambrose/pig/AmbrosePigProgressNotificationListener.java#L91
2 - 
https://github.com/twitter/ambrose/blob/master/pig/src/main/java/com/twitter/ambrose/pig/AmbrosePigProgressNotificationListener.java#L138

If we had those things, would we still need a unique id for the run? It would 
certainly be more robust that the start time, signature, workflow name.

Do we need node name?


                
> Add mapreduce workflow information to job configuration
> -------------------------------------------------------
>
>                 Key: PIG-3048
>                 URL: https://issues.apache.org/jira/browse/PIG-3048
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Billie Rinaldi
>         Attachments: PIG-3048.patch
>
>
> Adding workflow properties to the job configuration would enable logging and 
> analysis of workflows in addition to individual MapReduce jobs.  Suggested 
> properties include a workflow ID, workflow name, adjacency list connecting 
> nodes in the workflow, and the name of the current node in the workflow.
> mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
> the application name
> e.g. pig_<pigScriptId>
> mapreduce.workflow.name - a name for the workflow, to distinguish this 
> workflow from other workflows and to group different runs of the same workflow
> e.g. pig command line
> mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
> encoded as mapreduce.workflow.adjacency.<source node> = <comma-separated list 
> of target nodes>
> mapreduce.workflow.node.name - the name of the node corresponding to this 
> MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3048) Add mapreduce workflow information to job configuration

Reply via email to