[jira] [Commented] (TEZ-2864) Vertex group output commit overwrites without failing on conflict

Jason Lowe (JIRA) Thu, 08 Oct 2015 13:07:48 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949296#comment-14949296
 ]


Jason Lowe commented on TEZ-2864:
---------------------------------

Is this a Tez issue? I'm not sure how Tez is supposed to know about, and 
somehow fix, the specific internals of the output format and output committer 
which is arbitrary user code. Seems to me this specific instance is an artifact 
of the way FileOutputCommitter works, as it resolves output pathname conflicts 
by choosing one rather than failing. That behavior of FileOutputCommitter has 
been there since the 0.20 days, so we may need a config option to restore the 
original behavior if we decide to change it.

In general this appears to be a problem with Tez trying to reuse output formats 
and committers that may assume MapReduce semantics.  MapReduce output formats 
have been able to assume, safely, that no two tasks will have the same task 
number.  Tez invalidates that assumption.  The only way I can see that Tez can 
generically support unmodified MapReduce output formats is to guarantee that no 
two tasks have the same task number, even if they belong to different vertices. 
 Besides probably being complicated to implement internally, this would also 
have some unfortunate side effects such as output pathnames that could have 
gaps in the numbering and making it more difficult to track down which vertex 
task generated a particular output.

> Vertex group output commit overwrites without failing on conflict
> -----------------------------------------------------------------
>
>                 Key: TEZ-2864
>                 URL: https://issues.apache.org/jira/browse/TEZ-2864
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>
>  We encountered PIG-4649 with HCatStorer where it was hardcoding the part 
> file name and not honoring mapreduce.output.basename. If there were two 
> vertex groups writing to same output directory one was overwriting another as 
> file names were same without part-v000-o000 prefix. Tez should fail the job 
> in that case instead of silently losing data.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2864) Vertex group output commit overwrites without failing on conflict

Reply via email to