[jira] [Comment Edited] (CRUNCH-481) Support independent output committers for multiple outputs

Ryan Brush (JIRA) Thu, 05 Feb 2015 13:36:02 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308034#comment-14308034
 ]


Ryan Brush edited comment on CRUNCH-481 at 2/5/15 9:35 PM:
-----------------------------------------------------------

The exception I got above was caused by the fact that Kite's output committer 
uses the job ID for a temporary staging area, and when using multiple outputs 
from the same job, they collided. (I'm not very familiar with the commiter 
logic, but for some reason this wasn't exposed when running against Hadoop 1.)

I've attached a patch that works around this by "decorating" the ID in Job 
instance that is fabricated for each output with the output name itself. So the 
job names seen by the output format would be job_12345_out0, job_12345_out1, 
and so on. This avoids the name collision and works with both Hadoop 1 and 2 
builds. All Crunch tests pass as well.

Is this a good approach? The alternative would be to change Kite to use 
something besides the job ID for its temporary output location.


was (Author: rbrush):
The exception I got above was caused by the fact that Kite's output committer 
uses the job ID for a temporary staging area, and when using multiple outputs 
with the same name, they collided. (I'm not very familiar with the commiter 
logic, but for some reason this wasn't exposed when running against Hadoop 1.)

I've attached a patch that works around this by "decorating" the ID in Job 
instance that is fabricated for each output with the output name itself. So the 
job names seen by the output format would be job_12345_out0, job_12345_out1, 
and so on. This avoids the name collision and works with both Hadoop 1 and 2 
builds. All Crunch tests pass as well.

Is this a good approach? The alternative would be to change Kite to use 
something besides the job ID for its temporary output location.

> Support independent output committers for multiple outputs
> ----------------------------------------------------------
>
>                 Key: CRUNCH-481
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-481
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Aniket Kulkarni
>            Assignee: Josh Wills
>            Priority: Minor
>             Fix For: 0.12.0
>
>         Attachments: CRUNCH-481-hadoop-2-compat.patch, CRUNCH-481.patch, 
> CRUNCH-481.patch, CRUNCH-481.patch, CRUNCH-481c.patch
>
>
> I faced this issue while trying to write to Kite and HDFS in the same 
> pipeline. A similar issue was logged for Kite[1][2]. 
> I was attempting to write a PCollection to Kite and a different PTable to 
> HDFS as a text file. The write to Kite succeeded, however the write to HDFS 
> only produced a _SUCCESS file with no text file.
> Commenting out the write to Kite seems to solve the issue and I can see the 
> text file being written.
> [1] - https://issues.cloudera.org/browse/CDK-756
> [2] - 
> http://mail-archives.apache.org/mod_mbox/crunch-dev/201401.mbox/%3ccaf-wd4qcue0toh3qewpdnnom3u786pvjlgh7t6go_abctpl...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CRUNCH-481) Support independent output committers for multiple outputs

Reply via email to