[ 
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498370#comment-13498370
 ] 

Billie Rinaldi commented on PIG-3048:
-------------------------------------

bq. I could have been mis-reading your patch but I thought it was providing the 
adjacency list only from the job in question. The DAG would include that, plus 
all adjacency lists of other jobs in the script that might not be directly 
connected to the job in question.

Right, I remember now that operators are removed from the plan as jobs are 
executed.  I didn't seem to have access to the entire DAG at the point where 
the job configuration is populated, so I figured that you could just take the 
full DAG from the first job, or even better, merge the partial DAGs for all the 
jobs (in the event that some application adds onto its DAG after jobs have 
begun executing).  Instead, I could split the new configuration method into two 
parts, one that sets the adjacencies before any jobs are run, and another that 
sets the rest of the information.  I can certainly use OperatorKey.toString for 
the node names.

bq. If we had those things, would we still need a unique id for the run?

I'd like to have a workflow ID like pig_<scriptID> because then we don't have 
to know what makes a unique identifier for a particular application.  I've 
opened a similar ticket for Hive, HIVE-3708.  Do you think we should include 
the Pig version in the ID?  I think it makes sense to make the workflow name 
either the script name or the logical plan signature, or perhaps a 
concatenation of the two.  Is the script name what you meant by "the logical 
name of the deployed scheduled script"?

bq. Do we need node name?

If you don't have a node name, how do you know which job in the DAG is running?
                
> Add mapreduce workflow information to job configuration
> -------------------------------------------------------
>
>                 Key: PIG-3048
>                 URL: https://issues.apache.org/jira/browse/PIG-3048
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Billie Rinaldi
>         Attachments: PIG-3048.patch
>
>
> Adding workflow properties to the job configuration would enable logging and 
> analysis of workflows in addition to individual MapReduce jobs.  Suggested 
> properties include a workflow ID, workflow name, adjacency list connecting 
> nodes in the workflow, and the name of the current node in the workflow.
> mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with 
> the application name
> e.g. pig_<pigScriptId>
> mapreduce.workflow.name - a name for the workflow, to distinguish this 
> workflow from other workflows and to group different runs of the same workflow
> e.g. pig command line
> mapreduce.workflow.adjacency - an adjacency list for the workflow graph, 
> encoded as mapreduce.workflow.adjacency.<source node> = <comma-separated list 
> of target nodes>
> mapreduce.workflow.node.name - the name of the node corresponding to this 
> MapReduce job in the workflow adjacency list

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to