JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and only applicable if you run with Pig.
Daniel On 8/25/15, 2:08 PM, "Hitesh Shah" <[email protected]> wrote: >+dev@pig as this might be a question better answered by Pig developers. > >This probably won¹t answer your question but should give you some >background info. When Pig uses Tez, it may end up running multiple dags >within the same YARN application therefore the ³jobId² ( in case of MR, >job Id maps to the application Id from YARN ) may not be unique. >Furthermore, there are cases where multiple vertices within the same DAG >could write to HDFS hence both dagId and vertexId are required to >guarantee uniqueness when writing to a common location. > >thanks >‹ Hitesh > > >On Aug 25, 2015, at 7:29 AM, Shiri Marron <[email protected]> wrote: > >> Hi, >> >> We are trying to run our existing workflows that contains pig scripts, >>on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some >>problems when we run our code with tez. >> >> In our code, we are writing and reading from/to a temp directory which >>we create with a name based on the jobID: >> Part 1- We extend org.apache.hadoop.mapreduce.RecordWriter and in >>the close() -we take the jobID from TaskAttemptContext context. Meaning, >>each task writes a file to >> this directory in the close () method according to the jobID >>from the context. >> Part 2 - In the end of the whole job (after all the tasks were >>completed), we have our custom outputCommitter (which extends the >> >> org.apache.hadoop.mapreduce.OutputCommitter), and in the >>commitJob() it looks for that directory of the job and handles all the >>files under it- the jobID is taken from JobContext >>context.getJobID().toString() >> >> >> >> We noticed that when we use tez, this mechanism doesn't work since the >>jobID from the tez task (part one ) is combined from the original job >>id+vertex id , for example: 14404914675610 instead of 1440491467561 . So >>the directory name in part 2 is different than part 1. >> >> >> We looked for a way to retrieve only the vertex id or only the job id , >>but didn't find one - on the configuration the property: >> mapreduce.job.id also had the addition of the vertex id, and no other >>property value was equal to the original job id. >> >> Can you please advise how can we solve this issue? Is there a way to >>get the original jobID when we're in part 1? >> >> Regards, >> Shiri Marron >> Amdocs >> >> This message and the information contained herein is proprietary and >>confidential and subject to the Amdocs policy statement, >> you may review at http://www.amdocs.com/email_disclaimer.asp > >
