GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/6648

    [SPARK-8029][core][wip] shuffleoutput per attempt

    https://issues.apache.org/jira/browse/SPARK-8029
    
    This is a proof of concept for solving the issue by making each 
`ShuffleMapTask` attempt write to a different location.  `ShuffleBlockId` is 
extended to include the stage attempt id, so the fetch side knows which files 
to read from.  `MapStatus` also includes the stage attempt, so now there is one 
`MapStatus` per `(executor, attempt)` as opposed to one per `executor`.  This 
won't really matter when there is just one attempt per stage.  In a 
pathological case, you'd end up with one `MapStatus` per partition, which would 
be **much** worse, but that is very unlikely.
    
    I need to add in some more unit tests (I've put in some place holder 
TODOs), but I think this is ready for design / architecture level review.  It 
does pass my "integration" test.  (And when I include my changes to the 
DAGScheduler, it passes with the stricter criteria.)
    
    cc @JoshRosen 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark 
SPARK_8029_shuffleoutput_per_attempt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6648
    
----
commit d08c20cd1fbb22bb5db191db3d4616e5ed8b6f52
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T00:49:27Z

    tasks know which stageAttempt they belong to

commit 89e8428db2441258597e3962905da6317912cc12
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T03:54:57Z

    reproduce the failure

commit 70a787be6e55605365d84490e0d2072d4c7f5143
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T04:13:23Z

    ignore fetch failure from attempts that are already failed.  only a partial 
fix, still have some concurrent attempts

commit 7fbcefbdb466daca0f492966492e4d7710247810
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T04:15:11Z

    ignore the test for now just to avoid swamping jenkins

commit 2eebbf214bb534c5bd7ebaa9de2be7b3471a497b
Author: Imran Rashid <[email protected]>
Date:   2015-05-07T04:50:44Z

    style

commit 7142242547074604c1a7ef9dd701473dd40d4693
Author: Imran Rashid <[email protected]>
Date:   2015-05-12T15:52:30Z

    more rigorous test case

commit ccaa159a0d1a449acf271daa028a53228a879a4c
Author: Imran Rashid <[email protected]>
Date:   2015-05-12T15:53:09Z

    index file needs to handle cases when data file already exist, and the 
actual data is in the middle of it

commit 3585b968f40817feb15c7056e70a4f83a6891012
Author: Imran Rashid <[email protected]>
Date:   2015-05-12T16:14:08Z

    pare down the unit test

commit de235303c4e708323bbdcf2b6bbc003ad230fbe6
Author: Imran Rashid <[email protected]>
Date:   2015-05-12T16:27:49Z

    SparkIllegalStateException if we ever have multiple concurrent attempts for 
the same stage

commit c91ee10e166f07828bed66146a2bbdd28633fb34
Author: Imran Rashid <[email protected]>
Date:   2015-05-12T16:32:39Z

    better unit test

commit 05c72fda875a08ee3077ad232796cdf3d54d815f
Author: Imran Rashid <[email protected]>
Date:   2015-05-12T22:51:46Z

    handle more cases from bad ordering of task attempt completion

commit 5dc5436b1071e3b01229549a16ba830f05b278b4
Author: Imran Rashid <[email protected]>
Date:   2015-05-13T00:47:49Z

    Merge branch 'master' into SPARK-7308_fix
    
    Conflicts:
        
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
        
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala

commit 37eece86b0ef00fe6408963ffe5a8f40f7ec6990
Author: Imran Rashid <[email protected]>
Date:   2015-05-13T00:48:26Z

    cleanup imports

commit 31c21fae51e1aa304e4e53c7ef36d1e1689abdee
Author: Imran Rashid <[email protected]>
Date:   2015-05-13T01:26:50Z

    style

commit a894be11d2d22841017d82676a66088f0d053dcf
Author: Imran Rashid <[email protected]>
Date:   2015-05-13T13:42:05Z

    include all missing mapIds in error msg

commit 93592b1554ed8091c93a3f58cfd2951cb4cac088
Author: Imran Rashid <[email protected]>
Date:   2015-05-13T13:44:38Z

    update existing test since we now do more resubmitting than before

commit ea2d9720b26e687ddde777a83c6a6b11db663af2
Author: Imran Rashid <[email protected]>
Date:   2015-05-13T17:54:52Z

    style

commit de0a596911bfec15fe1d13070791a1d69979ec0f
Author: Imran Rashid <[email protected]>
Date:   2015-05-21T19:52:21Z

    Merge branch 'master' into SPARK-7308_fix
    
    Conflicts:
        core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

commit 6654c538cf83126d9ae0de7e39cc19829bfa6fba
Author: Imran Rashid <[email protected]>
Date:   2015-05-21T20:02:52Z

    fixes from merge

commit dd2839d1aceef3ca038c0af5cb2b69955d10cc8f
Author: Imran Rashid <[email protected]>
Date:   2015-05-21T21:54:31Z

    better fix from merge

commit e68492812d3384baefe29efa0a32fcca1fab12e6
Author: Imran Rashid <[email protected]>
Date:   2015-05-29T18:13:45Z

    shuffle map output writes to a different file per attempt (main compiles, 
tests do not)

commit 25234317978445b8bd10b6d6230000bfd7814267
Author: Imran Rashid <[email protected]>
Date:   2015-05-29T18:27:25Z

    tests compile

commit 4d976f4fdf64b1a97a9675e8bf3f5eb4a67da74c
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T18:18:15Z

    avoid NPE in finally block

commit 2b723fd002fe45be0d6f9e6d87f91fad9a9c0961
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T18:19:45Z

    use case class for result of mapOutputTracker.getServerStatus

commit fd40a93de16be76639cb2c4a89e60a0103e58677
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T18:19:59Z

    fix tests

commit b5d8ec5e00ef946580c92b2e43574caa72ffa0b5
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T18:30:55Z

    Merge branch 'master' into SPARK_8029_shuffleoutput_per_attempt
    
    Conflicts:
        
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala

commit 9f01d7ea510f9ed4e42ec5eafb847dfef5f24aed
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T19:13:23Z

    style

commit fae9c0c604dd55ee3d0fefdadb40b632554e4b85
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T19:22:07Z

    style

commit 06daceb2e027d64b975b29c2dac7385942fa8101
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T21:56:55Z

    make ContextCleanerSuite pass ... though maybe the test is pointless

commit cd16ee869e6374931b6dfbb90f33044f4579b7a7
Author: Imran Rashid <[email protected]>
Date:   2015-06-02T22:06:48Z

    fix tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to