GitHub user sharkdtu opened a pull request:

    https://github.com/apache/spark/pull/16912

    [SPARK-19576] [Core] Task attempt paths exist in output path after 
saveAsNewAPIHadoopFile completes with speculation enabled

    `writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks 
without question. The problem is that when speculation is enabled sometimes 
this can result in multiple tasks committing their output to the same path, 
which may lead to task temporary paths exist in output path after 
`saveAsNewAPIHadoopFile` completes. 
    
    ```scala
    -rw-r--r--    3   user group       0   2017-02-11 19:36 
hdfs://.../output/_SUCCESS
    drwxr-xr-x    -   user group       0   2017-02-11 19:36 
hdfs://.../output/attempt_201702111936_32487_r_000044_0
    -rw-r--r--    3   user group    8952   2017-02-11 19:36 
hdfs://.../output/part-r-00000
    -rw-r--r--    3   user group    7878   2017-02-11 19:36 
hdfs://.../output/part-r-00001
    ```
    Assume there are two attempt tasks that commit at the same time, The two 
attempt tasks maybe rename their task attempt paths to task committed path at 
the same time. When one task's `rename` operation completes, the other task's 
`rename` operation will let its task attempt path under the task committed path.
    
    Anyway, it is not recommended that `writeShard` in 
`saveAsNewAPIHadoopDataset` always committed its tasks without question. 
Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been 
solved. Newest master has solved it too. This PR just fix 2.1


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sharkdtu/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16912.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16912
    
----
commit 6f41b90583c585414b99fe716377d0576499de8d
Author: sharkdtu <[email protected]>
Date:   2017-02-13T11:46:48Z

    Task attempt paths exist in output path after saveAsNewAPIHadoopFile 
completes with speculation enabled

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to