[
https://issues.apache.org/jira/browse/TEZ-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949296#comment-14949296
]
Jason Lowe commented on TEZ-2864:
---------------------------------
Is this a Tez issue? I'm not sure how Tez is supposed to know about, and
somehow fix, the specific internals of the output format and output committer
which is arbitrary user code. Seems to me this specific instance is an artifact
of the way FileOutputCommitter works, as it resolves output pathname conflicts
by choosing one rather than failing. That behavior of FileOutputCommitter has
been there since the 0.20 days, so we may need a config option to restore the
original behavior if we decide to change it.
In general this appears to be a problem with Tez trying to reuse output formats
and committers that may assume MapReduce semantics. MapReduce output formats
have been able to assume, safely, that no two tasks will have the same task
number. Tez invalidates that assumption. The only way I can see that Tez can
generically support unmodified MapReduce output formats is to guarantee that no
two tasks have the same task number, even if they belong to different vertices.
Besides probably being complicated to implement internally, this would also
have some unfortunate side effects such as output pathnames that could have
gaps in the numbering and making it more difficult to track down which vertex
task generated a particular output.
> Vertex group output commit overwrites without failing on conflict
> -----------------------------------------------------------------
>
> Key: TEZ-2864
> URL: https://issues.apache.org/jira/browse/TEZ-2864
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
>
> We encountered PIG-4649 with HCatStorer where it was hardcoding the part
> file name and not honoring mapreduce.output.basename. If there were two
> vertex groups writing to same output directory one was overwriting another as
> file names were same without part-v000-o000 prefix. Tez should fail the job
> in that case instead of silently losing data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)