GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/6648
[SPARK-8029][core][wip] shuffleoutput per attempt
https://issues.apache.org/jira/browse/SPARK-8029
This is a proof of concept for solving the issue by making each
`ShuffleMapTask` attempt write to a different location. `ShuffleBlockId` is
extended to include the stage attempt id, so the fetch side knows which files
to read from. `MapStatus` also includes the stage attempt, so now there is one
`MapStatus` per `(executor, attempt)` as opposed to one per `executor`. This
won't really matter when there is just one attempt per stage. In a
pathological case, you'd end up with one `MapStatus` per partition, which would
be **much** worse, but that is very unlikely.
I need to add in some more unit tests (I've put in some place holder
TODOs), but I think this is ready for design / architecture level review. It
does pass my "integration" test. (And when I include my changes to the
DAGScheduler, it passes with the stricter criteria.)
cc @JoshRosen
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark
SPARK_8029_shuffleoutput_per_attempt
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6648.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6648
----
commit d08c20cd1fbb22bb5db191db3d4616e5ed8b6f52
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T00:49:27Z
tasks know which stageAttempt they belong to
commit 89e8428db2441258597e3962905da6317912cc12
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T03:54:57Z
reproduce the failure
commit 70a787be6e55605365d84490e0d2072d4c7f5143
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T04:13:23Z
ignore fetch failure from attempts that are already failed. only a partial
fix, still have some concurrent attempts
commit 7fbcefbdb466daca0f492966492e4d7710247810
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T04:15:11Z
ignore the test for now just to avoid swamping jenkins
commit 2eebbf214bb534c5bd7ebaa9de2be7b3471a497b
Author: Imran Rashid <[email protected]>
Date: 2015-05-07T04:50:44Z
style
commit 7142242547074604c1a7ef9dd701473dd40d4693
Author: Imran Rashid <[email protected]>
Date: 2015-05-12T15:52:30Z
more rigorous test case
commit ccaa159a0d1a449acf271daa028a53228a879a4c
Author: Imran Rashid <[email protected]>
Date: 2015-05-12T15:53:09Z
index file needs to handle cases when data file already exist, and the
actual data is in the middle of it
commit 3585b968f40817feb15c7056e70a4f83a6891012
Author: Imran Rashid <[email protected]>
Date: 2015-05-12T16:14:08Z
pare down the unit test
commit de235303c4e708323bbdcf2b6bbc003ad230fbe6
Author: Imran Rashid <[email protected]>
Date: 2015-05-12T16:27:49Z
SparkIllegalStateException if we ever have multiple concurrent attempts for
the same stage
commit c91ee10e166f07828bed66146a2bbdd28633fb34
Author: Imran Rashid <[email protected]>
Date: 2015-05-12T16:32:39Z
better unit test
commit 05c72fda875a08ee3077ad232796cdf3d54d815f
Author: Imran Rashid <[email protected]>
Date: 2015-05-12T22:51:46Z
handle more cases from bad ordering of task attempt completion
commit 5dc5436b1071e3b01229549a16ba830f05b278b4
Author: Imran Rashid <[email protected]>
Date: 2015-05-13T00:47:49Z
Merge branch 'master' into SPARK-7308_fix
Conflicts:
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
commit 37eece86b0ef00fe6408963ffe5a8f40f7ec6990
Author: Imran Rashid <[email protected]>
Date: 2015-05-13T00:48:26Z
cleanup imports
commit 31c21fae51e1aa304e4e53c7ef36d1e1689abdee
Author: Imran Rashid <[email protected]>
Date: 2015-05-13T01:26:50Z
style
commit a894be11d2d22841017d82676a66088f0d053dcf
Author: Imran Rashid <[email protected]>
Date: 2015-05-13T13:42:05Z
include all missing mapIds in error msg
commit 93592b1554ed8091c93a3f58cfd2951cb4cac088
Author: Imran Rashid <[email protected]>
Date: 2015-05-13T13:44:38Z
update existing test since we now do more resubmitting than before
commit ea2d9720b26e687ddde777a83c6a6b11db663af2
Author: Imran Rashid <[email protected]>
Date: 2015-05-13T17:54:52Z
style
commit de0a596911bfec15fe1d13070791a1d69979ec0f
Author: Imran Rashid <[email protected]>
Date: 2015-05-21T19:52:21Z
Merge branch 'master' into SPARK-7308_fix
Conflicts:
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
commit 6654c538cf83126d9ae0de7e39cc19829bfa6fba
Author: Imran Rashid <[email protected]>
Date: 2015-05-21T20:02:52Z
fixes from merge
commit dd2839d1aceef3ca038c0af5cb2b69955d10cc8f
Author: Imran Rashid <[email protected]>
Date: 2015-05-21T21:54:31Z
better fix from merge
commit e68492812d3384baefe29efa0a32fcca1fab12e6
Author: Imran Rashid <[email protected]>
Date: 2015-05-29T18:13:45Z
shuffle map output writes to a different file per attempt (main compiles,
tests do not)
commit 25234317978445b8bd10b6d6230000bfd7814267
Author: Imran Rashid <[email protected]>
Date: 2015-05-29T18:27:25Z
tests compile
commit 4d976f4fdf64b1a97a9675e8bf3f5eb4a67da74c
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T18:18:15Z
avoid NPE in finally block
commit 2b723fd002fe45be0d6f9e6d87f91fad9a9c0961
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T18:19:45Z
use case class for result of mapOutputTracker.getServerStatus
commit fd40a93de16be76639cb2c4a89e60a0103e58677
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T18:19:59Z
fix tests
commit b5d8ec5e00ef946580c92b2e43574caa72ffa0b5
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T18:30:55Z
Merge branch 'master' into SPARK_8029_shuffleoutput_per_attempt
Conflicts:
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala
commit 9f01d7ea510f9ed4e42ec5eafb847dfef5f24aed
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T19:13:23Z
style
commit fae9c0c604dd55ee3d0fefdadb40b632554e4b85
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T19:22:07Z
style
commit 06daceb2e027d64b975b29c2dac7385942fa8101
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T21:56:55Z
make ContextCleanerSuite pass ... though maybe the test is pointless
commit cd16ee869e6374931b6dfbb90f33044f4579b7a7
Author: Imran Rashid <[email protected]>
Date: 2015-06-02T22:06:48Z
fix tests
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]