GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/469

    [Spark-1538] Fix SparkUI incorrectly hiding persisted RDDs

    *Bug*: After the following command `sc.parallelize(1 to 1000, 
4).persist.map(_ + 1).count()` is run, the SparkUI does not show the persisted 
RDD.
    
    *Cause*: The command creates two RDDs in one stage, a 
`ParallelCollectionRDD` and a `MappedRDD`. However, the existing `StageInfo` 
only keeps the `RDDInfo` of the last RDD associated with the stage 
(`MappedRDD`), and so all RDD information regarding the first RDD 
(`ParallelCollectionRDD`) is discarded. In this case, we persist the first RDD, 
 but the `StorageTab` doesn't know about this RDD because it is not encoded in 
the `StageInfo`.
    
    *Fix*: Record information of all RDDs in `StageInfo`, instead of just the 
last RDD (i.e. `stage.rdd`). Since stage boundaries are marked by shuffle 
dependencies, the solution is to traverse the last RDD's dependency tree, 
visiting only ancestor RDDs related through a sequence of narrow dependencies.
    
    (This PR also moves `RDDInfo` to its own file.)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark storage-ui-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/469.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #469
    
----
commit bfe83f09e9d6f6c3dbce3d9d3caa0abc1dacb981
Author: Andrew Or <[email protected]>
Date:   2014-04-21T21:56:33Z

    Backtrace RDD dependency tree to find all RDDs that belong to a Stage
    
    The Stage boundary is marked by shuffle dependencies. When one or more RDD
    are related by narrow dependencies, they should all be associated with the
    same Stage. Following backward narrow dependency pointers allows StageInfo
    to hold the information of all relevant RDDs, rather than just the last one
    associated with the Stage.
    
    This commit also moves RDDInfo to its own file.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to