Hi,
We are trying to run our existing workflows that contains pig scripts, on tez
(version 0.5.2.2.2.6.0-2800, hdp 2.2) , but we are facing some problems when we
run our code with tez.
In our code, we are writing and reading from/to a temp directory which we
create with a name based on the jobID:
Part 1- We extend org.apache.hadoop.mapreduce.RecordWriter and in the
close() -we take the jobID from TaskAttemptContext context. Meaning, each task
writes a file to
this directory in the close () method according to the jobID from
the context.
Part 2 - In the end of the whole job (after all the tasks were completed),
we have our custom outputCommitter (which extends the
org.apache.hadoop.mapreduce.OutputCommitter), and in the
commitJob() it looks for that directory of the job and handles all the files
under it- the jobID is taken from JobContext context.getJobID().toString()
We noticed that when we use tez, this mechanism doesn't work since the jobID
from the tez task (part one ) is combined from the original job id+vertex id ,
for example: 14404914675610 instead of 1440491467561 . So the directory name in
part 2 is different than part 1.
We looked for a way to retrieve only the vertex id or only the job id , but
didn't find one - on the configuration the property:
mapreduce.job.id also had the addition of the vertex id, and no other property
value was equal to the original job id.
Can you please advise how can we solve this issue? Is there a way to get the
original jobID when we're in part 1?
Regards,
Shiri Marron
Amdocs
This message and the information contained herein is proprietary and
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp