[GitHub] spark pull request #13677: [SPARK 15926] Improve readability of DAGScheduler...

kayousterhout Tue, 14 Jun 2016 17:58:32 -0700

GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/13677


    [SPARK 15926] Improve readability of DAGScheduler stage creation methods

    ## What changes were proposed in this pull request?
    
    This pull request refactors parts of the DAGScheduler to improve 
readability, focusing on the code around stage creation.  One goal of this 
change it to make it clearer which functions may create new stages (as opposed 
to looking up stages that already exist).  There are no functionality changes 
in this pull request.  In more detail:
    
    * shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the 
existing code I have sometimes struggled to remember what the key is -- is it a 
stage? A stage id? This change is intended to avoid that confusion)
    * Cleaned up the code to create shuffle map stages.  Previously, creating a 
shuffle map stage involved 3 different functions (newOrUsedShuffleStage, 
newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the 
purpose of each function was.  With the new code, a single function 
(getOrCreateShuffleMapStage) is responsible for getting a stage (if it already 
exists) or creating new shuffle map stages and any missing ancestor stages, and 
it delegates to createShuffleMapStage when new stages need to be created.  
There's some remaining confusion here because the getOrCreateParentStages call 
in createShuffleMapStage may recursively create ancestor stages; this is an 
issue I plan to fix in a future pull request, because it's trickier to fix and 
involves a slight functionality change.
    * newResultStage was renamed to createResultStage, for consistency with 
naming around shuffle map stages.
    * getParentStages has been renamed to getOrCreateParentStages, to make it 
clear that this function will sometimes create missing ancestor stages.
    * The only *slight* functionality change is that on line 478, 
updateJobIdStageIdMaps now uses a stage's parents instance variable rather than 
re-calculating them (I couldn't see any reason why they'd need to be 
re-calculated, and suspect this is just leftover from older code).
    * getAncestorShuffleDependencies was renamed to 
getMissingAncestorShuffleDependencies, to make it clear that this only returns 
dependencies that have not yet been run.
    
    cc @squito @markhamstra @JoshRosen (who requested more DAG scheduler 
commenting long ago -- an issue this pull request tries, in part, to address)
    
    FYI @rxin
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 SPARK-15926

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13677.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13677
    
----
commit 4dca8001cd987016fd683988632337ec64b67444
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-15T00:34:50Z

    Renamed shuffleToMapStaqe in DAGScheduler.
    
    This commit renames shuffleToMapStage to shuffleIdToMapStage to make
    it clear that the shuffle id (and not the shuffle dependency, or the
    shuffle map stage, or the shuffle map stage id) is the key in the
    hash map.

commit 1468b91cc3ca493bc18e16525fe39444616f2bb2
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-10T21:17:10Z

    Cleaned up the code to create new shuffle map stages

commit a1a51ac8be0e5c22917464e2e9389fe33351f0b0
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-10T23:10:12Z

    Various comment and naming cleanups

commit 671d5de33c21dc279e6dda83dfa8366b8adf6f9c
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-11T00:22:59Z

    More cleanup of shuffle map stage creation

commit 43841a7350af38551e9c349ab2f22c7e7e0a444d
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-13T19:38:07Z

    Some more naming and commenting improvements

commit 49156f245baab7282575fe422f8e10906b571958
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-15T00:42:04Z

    Moved createShuffleMapStage to be a non-nested function.
    
    I think having it nested made the code harder to read.

commit 1b7a3386fb631f1a41ea73d6bfdc9961ed8e4234
Author: Kay Ousterhout <[email protected]>
Date:   2016-06-15T00:48:37Z

    Renamed newResultStage to createResultStage and added commenting

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13677: [SPARK 15926] Improve readability of DAGScheduler...

Reply via email to