Hi,

We are trying to run our existing workflows that contains pig scripts, on tez 
(version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when we 
run our code with tez.

In our code, we are writing and reading from/to a temp directory which we 
create with a name based on the  jobID:
     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and in the 
close() -we take the jobID from TaskAttemptContext context. Meaning, each task 
writes a file to
           this  directory in the close () method according to the jobID from 
the context.
    Part 2 -  In the end of the whole job (after all the tasks were completed), 
we have our custom outputCommitter (which extends the

               org.apache.hadoop.mapreduce.OutputCommitter), and in the 
commitJob()  it looks for that directory of the job and handles all the files 
under it-  the jobID is taken from JobContext context.getJobID().toString()



We noticed that when we use tez, this mechanism doesn't work since the jobID 
from the tez task (part one ) is combined from the original job id+vertex id , 
for example: 14404914675610 instead of 1440491467561 . So the directory name in 
part 2 is different than part 1.


We looked for a way to retrieve only the vertex id or only the job id , but 
didn't find one - on the configuration the  property:
mapreduce.job.id also had the addition of the vertex id, and no other property 
value was equal to the original job id.

Can you please advise how can we solve this issue?  Is there a way to get the 
original jobID when we're in part 1?

Regards,
Shiri Marron
Amdocs

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Reply via email to