GitHub user JoshRosen reopened a pull request:
https://github.com/apache/spark/pull/4066
[SPARK-4879] [WIP] Use driver to coordinate Hadoop output committing
(This is a WIP commit so that Jenkins tests my code; I still need to add
tests and think through a few corner-cases.)
I believe that Spark's SparkHadoopWriter is misusing Hadoop's
OutputCommitter: OutputCommitter.commitTask seems to assume that coordination
has been performed via the AM / Driver; our current lack of coordination can
lead to subtle bugs where task output is missing because redundant copies of
tasks are allowed to attempt to commit their output after a job has completed
(due to some odd Hadoop behaviors, this can lead to a completed job's output
being deleted).
The fix here is to add some centralized coordination in the driver for
deciding which copy of a task is allowed to commit its task output to HDFS.
The architecture here is a little hacky, since it involves a new RPC from
SparkOutputCommitter directly to the DAGScheduler. The reason that we send the
message to DAGScheduler, as opposed to some other actor, is to ensure proper
ordering / interleaving with other events.
See https://issues.apache.org/jira/browse/SPARK-4879 for full context.
I'll write a real commit message / description later (the problem is a little
subtle and it will take some work to come up with a nice, concise summary).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
SPARK-4879-sparkhadoopwriter-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4066.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4066
----
commit dbfed0f81001ac8866f32e1d9edd20a449a8b7e9
Author: Josh Rosen <[email protected]>
Date: 2015-01-16T01:52:21Z
WIP commit towards fixing SPARK-4879
commit c25c9972d9878b91ddcbc9c9a32d5453f781191a
Author: Josh Rosen <[email protected]>
Date: 2015-01-16T01:52:52Z
Fix scalastyle issue
commit beba16e8bcba493b8de26b065794014b64d23f82
Author: Josh Rosen <[email protected]>
Date: 2015-01-16T07:30:01Z
Fix NPE for non-result tasks
commit 8c64d12d2e4f5b7b377cce0f49c941870958cdef
Author: Josh Rosen <[email protected]>
Date: 2015-01-16T07:33:25Z
Fix NPE for tasks that complete after stage
commit 63a7707cad01f4dcc2c74c4a6bffded9c887f9d4
Author: Josh Rosen <[email protected]>
Date: 2015-01-16T21:13:11Z
Fix DAGScheduler actor path; use more SparkConf retry settings.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]