RE: Problem when running our code with tez

Hersh Shafer Thu, 27 Aug 2015 07:55:48 -0700

+Shiri

-----Original Message-----
From: Daniel Dai [mailto:[email protected]]
Sent: Wednesday, August 26, 2015 1:57 AM
To: [email protected]; [email protected]
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez


JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way 
you can get DagId within RecordWriter/OutputCommitter. A possible solution is 
to use conf.get(³mapreduce.workflow.id²) + 
conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific 
configuration and only applicable if you run with Pig.

Daniel




On 8/25/15, 2:08 PM, "Hitesh Shah" <[email protected]> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>
>This probably won¹t answer your question but should give you some
>background info. When Pig uses Tez, it may end up running multiple dags
>within the same YARN application therefore the ³jobId² ( in case of MR,
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same
>DAG could write to HDFS hence both dagId and vertexId are required to
>guarantee uniqueness when writing to a common location.
>
>thanks
>< Hitesh
>
>
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <[email protected]> wrote:
>
>> Hi,
>>
>> We are trying to run our existing workflows that contains pig
>>scripts, on tez (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are
>>facing some problems when we run our code with tez.
>>
>> In our code, we are writing and reading from/to a temp directory
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and
>>in the close() -we take the jobID from TaskAttemptContext context.
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were
>>completed), we have our custom outputCommitter (which extends the
>>
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in
>>the
>>commitJob()  it looks for that directory of the job and handles all
>>the files under it-  the jobID is taken from JobContext
>>context.getJobID().toString()
>>
>>
>>
>> We noticed that when we use tez, this mechanism doesn't work since
>>the jobID from the tez task (part one ) is combined from the original
>>job
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 .
>>id+So
>>the directory name in part 2 is different than part 1.
>>
>>
>> We looked for a way to retrieve only the vertex id or only the job id
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other
>>property value was equal to the original job id.
>>
>> Can you please advise how can we solve this issue?  Is there a way to
>>get the original jobID when we're in part 1?
>>
>> Regards,
>> Shiri Marron
>> Amdocs
>>
>> This message and the information contained herein is proprietary and
>>confidential and subject to the Amdocs policy statement,  you may
>>review at http://www.amdocs.com/email_disclaimer.asp
>
>


This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

RE: Problem when running our code with tez

Reply via email to