[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

JoshRosen Mon, 02 Feb 2015 23:18:07 -0800

GitHub user JoshRosen reopened a pull request:

    https://github.com/apache/spark/pull/4066


    [SPARK-4879] [WIP] Use driver to coordinate Hadoop output committing

    (This is a WIP commit so that Jenkins tests my code; I still need to add 
tests and think through a few corner-cases.)
    
    I believe that Spark's SparkHadoopWriter is misusing Hadoop's 
OutputCommitter: OutputCommitter.commitTask seems to assume that coordination 
has been performed via the AM / Driver; our current lack of coordination can 
lead to subtle bugs where task output is missing because redundant copies of 
tasks are allowed to attempt to commit their output after a job has completed 
(due to some odd Hadoop behaviors, this can lead to a completed job's output 
being deleted).
    
    The fix here is to add some centralized coordination in the driver for 
deciding which copy of a task is allowed to commit its task output to HDFS.  
The architecture here is a little hacky, since it involves a new RPC from 
SparkOutputCommitter directly to the DAGScheduler.  The reason that we send the 
message to DAGScheduler, as opposed to some other actor, is to ensure proper 
ordering / interleaving with other events.
    
    See https://issues.apache.org/jira/browse/SPARK-4879 for full context.  
I'll write a real commit message / description later (the problem is a little 
subtle and it will take some work to come up with a nice, concise summary).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
SPARK-4879-sparkhadoopwriter-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4066
    
----
commit dbfed0f81001ac8866f32e1d9edd20a449a8b7e9
Author: Josh Rosen <[email protected]>
Date:   2015-01-16T01:52:21Z

    WIP commit towards fixing SPARK-4879

commit c25c9972d9878b91ddcbc9c9a32d5453f781191a
Author: Josh Rosen <[email protected]>
Date:   2015-01-16T01:52:52Z

    Fix scalastyle issue

commit beba16e8bcba493b8de26b065794014b64d23f82
Author: Josh Rosen <[email protected]>
Date:   2015-01-16T07:30:01Z

    Fix NPE for non-result tasks

commit 8c64d12d2e4f5b7b377cce0f49c941870958cdef
Author: Josh Rosen <[email protected]>
Date:   2015-01-16T07:33:25Z

    Fix NPE for tasks that complete after stage

commit 63a7707cad01f4dcc2c74c4a6bffded9c887f9d4
Author: Josh Rosen <[email protected]>
Date:   2015-01-16T21:13:11Z

    Fix DAGScheduler actor path; use more SparkConf retry settings.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

Reply via email to