Re: Problem when running our code with tez

Hitesh Shah Tue, 25 Aug 2015 14:09:18 -0700

+dev@pig as this might be a question better answered by Pig developers. 

This probably won’t answer your question but should give you some background 
info. When Pig uses Tez, it may end up running multiple dags within the same 
YARN application therefore the “jobId” ( in case of MR, job Id maps to the 
application Id from YARN ) may not be unique. Furthermore, there are cases 
where multiple vertices within the same DAG could write to HDFS hence both 
dagId and vertexId are required to guarantee uniqueness when writing to a 
common location. 
 
thanks
— Hitesh



On Aug 25, 2015, at 7:29 AM, Shiri Marron <[email protected]> wrote:

> Hi,
> 
> We are trying to run our existing workflows that contains pig scripts, on tez 
> (version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when 
> we run our code with tez.
> 
> In our code, we are writing and reading from/to a temp directory which we 
> create with a name based on the  jobID:
>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in the 
> close() -we take the jobID from TaskAttemptContext context. Meaning, each 
> task writes a file to
>           this  directory in the close () method according to the jobID from 
> the context.
>    Part 2 -  In the end of the whole job (after all the tasks were 
> completed), we have our custom outputCommitter (which extends the
> 
>               org.apache.hadoop.mapreduce.OutputCommitter), and in the 
> commitJob()  it looks for that directory of the job and handles all the files 
> under it-  the jobID is taken from JobContext context.getJobID().toString()
> 
> 
> 
> We noticed that when we use tez, this mechanism doesn't work since the jobID 
> from the tez task (part one ) is combined from the original job id+vertex id 
> , for example: 14404914675610 instead of 1440491467561 . So the directory 
> name in part 2 is different than part 1.
> 
> 
> We looked for a way to retrieve only the vertex id or only the job id , but 
> didn't find one - on the configuration the  property:
> mapreduce.job.id also had the addition of the vertex id, and no other 
> property value was equal to the original job id.
> 
> Can you please advise how can we solve this issue?  Is there a way to get the 
> original jobID when we're in part 1?
> 
> Regards,
> Shiri Marron
> Amdocs
> 
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp

Re: Problem when running our code with tez

Reply via email to