Re: HCatalog and Crunch named outputs

Josh Wills Wed, 25 Oct 2017 22:00:28 -0700

I'll need to look more closely at the code tomorrow to confirm, but IIRC,
that comment was wrong, and it's safe to proceed with this change.


Josh
On Wed, Oct 25, 2017 at 1:57 PM Stephen Durfey <sjdur...@gmail.com> wrote:

> I've recently taken up the work efforts on CRUNCH-340 [1] to get a
> functioning source and target for going against HCatalog. One of the issues
> I've ran into is around named outputs being added to the JobID, which then
> makes its way into the TaskAttemptID. The stack trace is below.
>
> The issue is the named output (e.g. 'out0') becomes part of the
> TaskAttemptID and the HCat output committer is trying to map between
> o.a.h.mapreduce.TaskAttemptID and o.a.h.mapred.TaskAttemptID [2] it fails
> between TaskAttemptID.forName expects the id to only be 6 parts, separated
> by underscores, and with the named output, it becomes 7. If I remove the
> named output from being set on the JobID, then everything works fine [3].
>
> However, I am hesitant with that change. In the version of code I am
> working against (0.11.x at the moment) there is a comment stating that
> certain output formats rely upon this change. However, in the latest
> version of the code in master, that comment has been removed. I'm curious
> if the comment was removed because it is no longer true, and thus safe to
> remove the named output from the job id, or if there is a better/more
> preferred way to handle the exception below.
>
>
> Error: java.lang.IllegalArgumentException: TaskAttemptId string :
> > attempt_1508401628996_out0_16350_m_000000_0 is not properly formed at
> > org.apache.hadoop.mapreduce.TaskAttemptID.forName(TaskAttemptID.java:201)
> > at org.apache.hadoop.mapred.TaskAttemptID.forName(TaskAttemptID.java:129)
> > at
> >
> org.apache.hive.hcatalog.mapreduce.HCatMapRedUtil.createTaskAttemptContext(HCatMapRedUtil.java:35)
> > at
> >
> org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.setupTask(FileOutputCommitterContainer.java:172)
> > at
> > org.apache.crunch.io
> .CrunchOutputs$CompositeOutputCommitter.setupTask(CrunchOutputs.java:334)
> > at org.apache.hadoop.mapred.Task.initialize(Task.java:582) at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at
>
>
>
> [1] https://issues.apache.org/jira/browse/CRUNCH-340
> [2]
>
> https://github.com/cloudera/hive/blob/cdh5.13.0-release/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/HCatMapRedUtil.java#L34
> [3]
>
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/CrunchOutputs.java#L230
>

Re: HCatalog and Crunch named outputs

Reply via email to