GitHub user kayousterhout opened a pull request:
https://github.com/apache/spark/pull/13677
[SPARK 15926] Improve readability of DAGScheduler stage creation methods
## What changes were proposed in this pull request?
This pull request refactors parts of the DAGScheduler to improve
readability, focusing on the code around stage creation. One goal of this
change it to make it clearer which functions may create new stages (as opposed
to looking up stages that already exist). There are no functionality changes
in this pull request. In more detail:
* shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the
existing code I have sometimes struggled to remember what the key is -- is it a
stage? A stage id? This change is intended to avoid that confusion)
* Cleaned up the code to create shuffle map stages. Previously, creating a
shuffle map stage involved 3 different functions (newOrUsedShuffleStage,
newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the
purpose of each function was. With the new code, a single function
(getOrCreateShuffleMapStage) is responsible for getting a stage (if it already
exists) or creating new shuffle map stages and any missing ancestor stages, and
it delegates to createShuffleMapStage when new stages need to be created.
There's some remaining confusion here because the getOrCreateParentStages call
in createShuffleMapStage may recursively create ancestor stages; this is an
issue I plan to fix in a future pull request, because it's trickier to fix and
involves a slight functionality change.
* newResultStage was renamed to createResultStage, for consistency with
naming around shuffle map stages.
* getParentStages has been renamed to getOrCreateParentStages, to make it
clear that this function will sometimes create missing ancestor stages.
* The only *slight* functionality change is that on line 478,
updateJobIdStageIdMaps now uses a stage's parents instance variable rather than
re-calculating them (I couldn't see any reason why they'd need to be
re-calculated, and suspect this is just leftover from older code).
* getAncestorShuffleDependencies was renamed to
getMissingAncestorShuffleDependencies, to make it clear that this only returns
dependencies that have not yet been run.
cc @squito @markhamstra @JoshRosen (who requested more DAG scheduler
commenting long ago -- an issue this pull request tries, in part, to address)
FYI @rxin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kayousterhout/spark-1 SPARK-15926
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13677.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13677
----
commit 4dca8001cd987016fd683988632337ec64b67444
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-15T00:34:50Z
Renamed shuffleToMapStaqe in DAGScheduler.
This commit renames shuffleToMapStage to shuffleIdToMapStage to make
it clear that the shuffle id (and not the shuffle dependency, or the
shuffle map stage, or the shuffle map stage id) is the key in the
hash map.
commit 1468b91cc3ca493bc18e16525fe39444616f2bb2
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-10T21:17:10Z
Cleaned up the code to create new shuffle map stages
commit a1a51ac8be0e5c22917464e2e9389fe33351f0b0
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-10T23:10:12Z
Various comment and naming cleanups
commit 671d5de33c21dc279e6dda83dfa8366b8adf6f9c
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-11T00:22:59Z
More cleanup of shuffle map stage creation
commit 43841a7350af38551e9c349ab2f22c7e7e0a444d
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-13T19:38:07Z
Some more naming and commenting improvements
commit 49156f245baab7282575fe422f8e10906b571958
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-15T00:42:04Z
Moved createShuffleMapStage to be a non-nested function.
I think having it nested made the code harder to read.
commit 1b7a3386fb631f1a41ea73d6bfdc9961ed8e4234
Author: Kay Ousterhout <[email protected]>
Date: 2016-06-15T00:48:37Z
Renamed newResultStage to createResultStage and added commenting
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]