[
https://issues.apache.org/jira/browse/PIG-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498370#comment-13498370
]
Billie Rinaldi commented on PIG-3048:
-------------------------------------
bq. I could have been mis-reading your patch but I thought it was providing the
adjacency list only from the job in question. The DAG would include that, plus
all adjacency lists of other jobs in the script that might not be directly
connected to the job in question.
Right, I remember now that operators are removed from the plan as jobs are
executed. I didn't seem to have access to the entire DAG at the point where
the job configuration is populated, so I figured that you could just take the
full DAG from the first job, or even better, merge the partial DAGs for all the
jobs (in the event that some application adds onto its DAG after jobs have
begun executing). Instead, I could split the new configuration method into two
parts, one that sets the adjacencies before any jobs are run, and another that
sets the rest of the information. I can certainly use OperatorKey.toString for
the node names.
bq. If we had those things, would we still need a unique id for the run?
I'd like to have a workflow ID like pig_<scriptID> because then we don't have
to know what makes a unique identifier for a particular application. I've
opened a similar ticket for Hive, HIVE-3708. Do you think we should include
the Pig version in the ID? I think it makes sense to make the workflow name
either the script name or the logical plan signature, or perhaps a
concatenation of the two. Is the script name what you meant by "the logical
name of the deployed scheduled script"?
bq. Do we need node name?
If you don't have a node name, how do you know which job in the DAG is running?
> Add mapreduce workflow information to job configuration
> -------------------------------------------------------
>
> Key: PIG-3048
> URL: https://issues.apache.org/jira/browse/PIG-3048
> Project: Pig
> Issue Type: Improvement
> Reporter: Billie Rinaldi
> Attachments: PIG-3048.patch
>
>
> Adding workflow properties to the job configuration would enable logging and
> analysis of workflows in addition to individual MapReduce jobs. Suggested
> properties include a workflow ID, workflow name, adjacency list connecting
> nodes in the workflow, and the name of the current node in the workflow.
> mapreduce.workflow.id - a unique ID for the workflow, ideally prepended with
> the application name
> e.g. pig_<pigScriptId>
> mapreduce.workflow.name - a name for the workflow, to distinguish this
> workflow from other workflows and to group different runs of the same workflow
> e.g. pig command line
> mapreduce.workflow.adjacency - an adjacency list for the workflow graph,
> encoded as mapreduce.workflow.adjacency.<source node> = <comma-separated list
> of target nodes>
> mapreduce.workflow.node.name - the name of the node corresponding to this
> MapReduce job in the workflow adjacency list
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira