+Shiri -----Original Message----- From: Daniel Dai [mailto:[email protected]] Sent: Wednesday, August 26, 2015 1:57 AM To: [email protected]; [email protected] Cc: Hersh Shafer; Almog Shunim Subject: Re: Problem when running our code with tez
JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²) + conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and only applicable if you run with Pig. Daniel On 8/25/15, 2:08 PM, "Hitesh Shah" <[email protected]> wrote: >+dev@pig as this might be a question better answered by Pig developers. > >This probably won¹t answer your question but should give you some >background info. When Pig uses Tez, it may end up running multiple dags >within the same YARN application therefore the ³jobId² ( in case of MR, >job Id maps to the application Id from YARN ) may not be unique. >Furthermore, there are cases where multiple vertices within the same >DAG could write to HDFS hence both dagId and vertexId are required to >guarantee uniqueness when writing to a common location. > >thanks >< Hitesh > > >On Aug 25, 2015, at 7:29 AM, Shiri Marron <[email protected]> wrote: > >> Hi, >> >> We are trying to run our existing workflows that contains pig >>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are >>facing some problems when we run our code with tez. >> >> In our code, we are writing and reading from/to a temp directory >>which we create with a name based on the jobID: >> Part 1- We extend org.apache.hadoop.mapreduce.RecordWriter and >>in the close() -we take the jobID from TaskAttemptContext context. >>Meaning, each task writes a file to >> this directory in the close () method according to the >>jobID from the context. >> Part 2 - In the end of the whole job (after all the tasks were >>completed), we have our custom outputCommitter (which extends the >> >> org.apache.hadoop.mapreduce.OutputCommitter), and in >>the >>commitJob() it looks for that directory of the job and handles all >>the files under it- the jobID is taken from JobContext >>context.getJobID().toString() >> >> >> >> We noticed that when we use tez, this mechanism doesn't work since >>the jobID from the tez task (part one ) is combined from the original >>job >>id+vertex id , for example: 14404914675610 instead of 1440491467561 . >>id+So >>the directory name in part 2 is different than part 1. >> >> >> We looked for a way to retrieve only the vertex id or only the job id >>, but didn't find one - on the configuration the property: >> mapreduce.job.id also had the addition of the vertex id, and no other >>property value was equal to the original job id. >> >> Can you please advise how can we solve this issue? Is there a way to >>get the original jobID when we're in part 1? >> >> Regards, >> Shiri Marron >> Amdocs >> >> This message and the information contained herein is proprietary and >>confidential and subject to the Amdocs policy statement, you may >>review at http://www.amdocs.com/email_disclaimer.asp > > This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp
