Re: Problem when running our code with tez

Daniel Dai Tue, 25 Aug 2015 15:58:36 -0700

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a
way you can get DagId within RecordWriter/OutputCommitter. A possible
solution is to use conf.get(³mapreduce.workflow.id²) +
conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific
configuration and only applicable if you run with Pig.


Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <[email protected]> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some
>background info. When Pig uses Tez, it may end up running multiple dags
>within the same YARN application therefore the ³jobId² ( in case of MR,
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same DAG
>could write to HDFS hence both dagId and vertexId are required to
>guarantee uniqueness when writing to a common location.
> 
>thanks
>‹ Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <[email protected]> wrote:
>
>> Hi,
>> 
>> We are trying to run our existing workflows that contains pig scripts,
>>on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some
>>problems when we run our code with tez.
>> 
>> In our code, we are writing and reading from/to a temp directory which
>>we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in
>>the close() -we take the jobID from TaskAttemptContext context. Meaning,
>>each task writes a file to
>>           this  directory in the close () method according to the jobID
>>from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were
>>completed), we have our custom outputCommitter (which extends the
>> 
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in the
>>commitJob()  it looks for that directory of the job and handles all the
>>files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>> 
>> 
>> 
>> We noticed that when we use tez, this mechanism doesn't work since the
>>jobID from the tez task (part one ) is combined from the original job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . So
>>the directory name in part 2 is different than part 1.
>> 
>> 
>> We looked for a way to retrieve only the vertex id or only the job id ,
>>but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other
>>property value was equal to the original job id.
>> 
>> Can you please advise how can we solve this issue?  Is there a way to
>>get the original jobID when we're in part 1?
>> 
>> Regards,
>> Shiri Marron
>> Amdocs
>> 
>> This message and the information contained herein is proprietary and
>>confidential and subject to the Amdocs policy statement,
>> you may review at http://www.amdocs.com/email_disclaimer.asp
>
>

Re: Problem when running our code with tez

Reply via email to