GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/469
[Spark-1538] Fix SparkUI incorrectly hiding persisted RDDs
*Bug*: After the following command `sc.parallelize(1 to 1000,
4).persist.map(_ + 1).count()` is run, the SparkUI does not show the persisted
RDD.
*Cause*: The command creates two RDDs in one stage, a
`ParallelCollectionRDD` and a `MappedRDD`. However, the existing `StageInfo`
only keeps the `RDDInfo` of the last RDD associated with the stage
(`MappedRDD`), and so all RDD information regarding the first RDD
(`ParallelCollectionRDD`) is discarded. In this case, we persist the first RDD,
but the `StorageTab` doesn't know about this RDD because it is not encoded in
the `StageInfo`.
*Fix*: Record information of all RDDs in `StageInfo`, instead of just the
last RDD (i.e. `stage.rdd`). Since stage boundaries are marked by shuffle
dependencies, the solution is to traverse the last RDD's dependency tree,
visiting only ancestor RDDs related through a sequence of narrow dependencies.
(This PR also moves `RDDInfo` to its own file.)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark storage-ui-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/469.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #469
----
commit bfe83f09e9d6f6c3dbce3d9d3caa0abc1dacb981
Author: Andrew Or <[email protected]>
Date: 2014-04-21T21:56:33Z
Backtrace RDD dependency tree to find all RDDs that belong to a Stage
The Stage boundary is marked by shuffle dependencies. When one or more RDD
are related by narrow dependencies, they should all be associated with the
same Stage. Following backward narrow dependency pointers allows StageInfo
to hold the information of all relevant RDDs, rather than just the last one
associated with the Stage.
This commit also moves RDDInfo to its own file.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---