GitHub user sharkdtu opened a pull request:
https://github.com/apache/spark/pull/16912
[SPARK-19576] [Core] Task attempt paths exist in output path after
saveAsNewAPIHadoopFile completes with speculation enabled
`writeShard` in `saveAsNewAPIHadoopDataset` always committed its tasks
without question. The problem is that when speculation is enabled sometimes
this can result in multiple tasks committing their output to the same path,
which may lead to task temporary paths exist in output path after
`saveAsNewAPIHadoopFile` completes.
```scala
-rw-r--r-- 3 user group 0 2017-02-11 19:36
hdfs://.../output/_SUCCESS
drwxr-xr-x - user group 0 2017-02-11 19:36
hdfs://.../output/attempt_201702111936_32487_r_000044_0
-rw-r--r-- 3 user group 8952 2017-02-11 19:36
hdfs://.../output/part-r-00000
-rw-r--r-- 3 user group 7878 2017-02-11 19:36
hdfs://.../output/part-r-00001
```
Assume there are two attempt tasks that commit at the same time, The two
attempt tasks maybe rename their task attempt paths to task committed path at
the same time. When one task's `rename` operation completes, the other task's
`rename` operation will let its task attempt path under the task committed path.
Anyway, it is not recommended that `writeShard` in
`saveAsNewAPIHadoopDataset` always committed its tasks without question.
Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been
solved. Newest master has solved it too. This PR just fix 2.1
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sharkdtu/spark branch-2.1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16912.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16912
----
commit 6f41b90583c585414b99fe716377d0576499de8d
Author: sharkdtu <[email protected]>
Date: 2017-02-13T11:46:48Z
Task attempt paths exist in output path after saveAsNewAPIHadoopFile
completes with speculation enabled
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]