[jira] [Commented] (HADOOP-15107) Stabilize/tune S3A committers; review correctness & docs

Aaron Fabbri (JIRA) Wed, 29 Aug 2018 12:52:12 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596784#comment-16596784
 ]


Aaron Fabbri commented on HADOOP-15107:
---------------------------------------

+1 on the v4 patch. Code all looks good.

{noformat}
+    } else {
+      LOG.warn("Using standard FileOutputCommitter to commit work."
+          + " This is slow and potentially unsafe.");
+      return createFileOutputCommitter(outputPath, context);{noformat}

Good call, I like it.

On the docs changes, just some random questions:
{noformat}
```python
def recoverTask(tac):
  oldAttemptId = appAttemptId - 1
{noformat}
Interesting. New commit attempts always get the same attempt id +1? (I don't 
know how those are allocated)

The mergePathsV1 seems pretty straightforward.  Not sure why the actual code is 
so complicated.  Your pseudocode representation seems fairly intuitive.  
Overwriting stuff that exists in the destination, recursively so you don't just 
nuke directories that exist in the destination, instead descending and removing 
destination conflicts as they arise (files).  Special case if src is file but 
dest is dir (nuke dest).

{noformat}
### v2 Job Recovery Before `commitJob()`


Because the data has been renamed into the destination directory, all tasks
recorded as having being committed have no recovery needed at all:

```python
def recoverTask(tac):
```

All active and queued tasks are scheduled for execution.

There is a weakness here, the same one on a failure during `commitTask()`:
it is only safe to repeat a task which failed during that commit operation
if the name of all generated files are constant across all task attempts.

If the Job AM fails while a task attempt has been instructed to commit,
and that commit is not recorded as having completed, the state of that
in-progress task is unknown...really it isn't be safe to recover the
job at this point.
{noformat}

Interesting. What happens in this case?  Is it detected? Do we get duplicate 
data in the final job (re-attempt) output?

> Stabilize/tune S3A committers; review correctness & docs
> --------------------------------------------------------
>
>                 Key: HADOOP-15107
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15107
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.1.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>         Attachments: HADOOP-15107-001.patch, HADOOP-15107-002.patch, 
> HADOOP-15107-003.patch, HADOOP-15107-004.patch
>
>
> I'm writing about the paper on the committers, one which, being a proper 
> paper, requires me to show the committers work.
> # define the requirements of a "Correct" committed job (this applies to the 
> FileOutputCommitter too)
> # show that the Staging committer meets these requirements (most of this is 
> implicit in that it uses the V1 FileOutputCommitter to marshall .pendingset 
> lists from committed tasks to the final destination, where they are read and 
> committed.
> # Show the magic committer also works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15107) Stabilize/tune S3A committers; review correctness & docs

Reply via email to