[ 
https://issues.apache.org/jira/browse/TEZ-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874491#comment-13874491
 ] 

Bikas Saha commented on TEZ-728:
--------------------------------

Will change the javadoc. MROutputCommitter calls the underlying 
OutputCommitter.abortJob(). FileOutputCommitter doesnt really support aborting 
after committing since it wasnt an expected use case.

I spoke to Hitesh about abortOutput() semantics. We agreed that abort could be 
called multiple times. Only commit is guaranteed to be called once.

Initialization happens at the beginning. So should be fine.

I dont think commit needs to be tied to vertex completion since its a final 
output visibility concept. A vertex output may not even have a committer. And 
the vertex would still be successful. For the all or none case, output 
visibility is determined by dag success and not the success of the tasks that 
produced the output. For the partial output case vertex will be successful 
after commit. So anyone depending on monitoring that vertex is also good. IMO 
we are fine as of now. 

Its checking if successful vertex outputs need to be committed or not. If 
successfulOutputsAlreadyCommitted is true then they have already been committed 
(commit on vertex commit case). So nothing to do. If 
committedOutputs.contains(committer) then this was committed in the previous 
if() block. So nothing to do. Actually this entire block can be removed as it 
is dead code.

For partial commits the threading will be needed. At the end of a DAG its 
probably not needed but easy to do since the body of the for loop needs to be 
in a thread.

> Semantics of output commit
> --------------------------
>
>                 Key: TEZ-728
>                 URL: https://issues.apache.org/jira/browse/TEZ-728
>             Project: Apache Tez
>          Issue Type: Task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-728.1.patch
>
>
> Currently, vertices commit outputs when they succeed. However, if the job 
> fails then these outputs are not aborted.
> After speaking to Pig and Hive folks, both allow optional partial visibility 
> semantics. So if there are 2 vertices writing output and one of them (A) 
> passes and the other fails. Based on a user flag, Pig and Hive allow the 
> partial output of vertex A to be visible or not. So we need to support 
> 1) DAG fails - no output is visible
> 2) DAG fails - partial output is visible
> In order to support this, we could move output commit to DAG completion. If 
> the DAG succeeds, commit will be called on all output committers. If the DAG 
> fails, then abort will be called on all output committers. Optionally, if the 
> DAG fails then commit will be called on all successful vertices and abort 
> will be called on all failed vertices.
> This will also help the case when multiple vertices are writing to the same 
> output (union store). The DAG can call commit once on that output and ensure 
> correct commit semantics according to the commit API.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to