GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/5729
[SPARK-6943][WIP] Show RDD DAG visualization on stage UI
This patch is not working yet in its current state. It is currently
blocking on a closure cleaner fix #5685 before it can produce the screenshot
below.
-----------------------------------------------------------------------------------------------------------------------------------------
This patch adds the functionality to display the per-stage RDD DAG on the
SparkUI. On a high level, the information displayed include:
- The names and IDs of all RDDs involved in a stage
- How these RDDs depend on each other, and
- What scopes these RDDs are defined in
Scope here refers to the user-facing operation that created the RDDs (e.g.
`textFile`, `treeAggregate`).
This blatantly stole a few lines of HTML and JavaScript from #5547 (thanks
@shroffpradyumn!). We will have to deal with the merge conflicts a little later.
For instance, here's what the first stage of word count looks like:
```
sc.textFile("README.md")
.flatMap { _.split(" ") }
.map { (_, 1) }
.reduceByKey { _ + _ }
.sortBy(-_._2)
```

You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark viz2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5729.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5729
----
commit 6b3403be587fce495276fcb137d3d8d7afc839a7
Author: Andrew Or <[email protected]>
Date: 2015-04-17T00:33:26Z
Scope all RDD methods
This commit provides a mechanism to set and unset the call scope
around each RDD operation defined in RDD.scala. This is useful
for tagging an RDD with the scope in which it is created. This
will be extended to similar methods in SparkContext.scala and
other relevant files in a future commit.
commit a9ed4f9e563a6b4ba4a351f0170da53b3a4c973f
Author: Andrew Or <[email protected]>
Date: 2015-04-17T00:46:19Z
Add a few missing scopes to certain RDD methods
commit 5143523227d1dc989658f2f8a11e5fa97d8add03
Author: Andrew Or <[email protected]>
Date: 2015-04-17T01:44:08Z
Expose the necessary information in RDDInfo
This includes the scope field that we added in previous commits,
and the parent IDs for tracking the lineage through the listener
API.
commit 21843488193295fea8a08c3cb1556d0b62a809ba
Author: Andrew Or <[email protected]>
Date: 2015-04-17T18:00:31Z
Translate RDD information to dot file
It turns out that the previous scope information is insufficient
for producing a valid dot file. In particular, the scope hierarchy
was missing, but crucial to differentiate between a parent RDD
being in the same encompassing scope and it being in a completely
distinct scope. Also, unique scope identifiers are needed to
simplify the code significantly.
This commit further adds the translation logic in a UI listener
that converts RDDInfos to dot files.
commit f22f3379edbdb301631440d1627fb633d0da143f
Author: Andrew Or <[email protected]>
Date: 2015-04-17T20:52:17Z
First working implementation of visualization with vis.js
commit 9fac6f37e08b74ae19fa268923d10871ffe08aed
Author: Andrew Or <[email protected]>
Date: 2015-04-22T02:23:16Z
Re-implement scopes through annotations instead
The previous "working" implementation frequently ran into
NotSerializableExceptions. Why? ClosureCleaner doesn't like
closures being wrapped in other closures, and these closures
are simply not cleaned (details are intentionally omitted here).
This commit reimplements scoping through annotations. All methods
that should be scoped are now annotated with @RDDScope. Then, on
creation, each RDD derives its scope from the stack trace, similar
to how it derives its call site. This is the cleanest approach
that bypasses NotSerializableExceptions with least significant
limitations.
commit 494d5c28b38d3d829f008a1bba406e63d4ec8680
Author: Andrew Or <[email protected]>
Date: 2015-04-22T02:39:14Z
Revert a few unintended style changes
commit 6a7cdcaed6bb6fd856bd7e2e15b0d78cbdb0b2d1
Author: Andrew Or <[email protected]>
Date: 2015-04-22T03:00:30Z
Move RDD scope util methods and logic to its own file
Just a small code re-organization.
commit 5e22946945f683927cabafeb0ede3bc8e275e4a0
Author: Andrew Or <[email protected]>
Date: 2015-04-22T03:01:17Z
Merge branch 'master' of github.com:apache/spark into viz
commit 205f838477de8cabd28aab6301a67fd7d07bc517
Author: Andrew Or <[email protected]>
Date: 2015-04-23T05:33:31Z
Reimplement rendering with dagre-d3 instead of viz.js
Before this commit, this patch relies on a JavaScript version of
GraphViz that was compiled from C. Even the minified version of
this resource was ~2.5M. The main motivation for switching away
from this library, however, is that this is a complete black box
of which we have absolutely no control. It is not at all extensible,
and if something breaks we will have a hard time understanding
why.
The new library, dagre-d3, is not perfect either. It does not
officially support clustering of nodes; for certain large graphs,
the clusters will have a lot of unnecessary whitespace. A few in
the dagre-d3 community are looking into a solution, but until then
we will have to live with this (minor) inconvenience.
commit fe7816fe25c2f68ff2eee931ebe7a95b1cc97cdf
Author: Andrew Or <[email protected]>
Date: 2015-04-27T19:37:41Z
Merge branch 'master' of github.com:apache/spark into viz
commit 8dd5af265ee0c395c4c6d831ca697775d9e28104
Author: Andrew Or <[email protected]>
Date: 2015-04-27T21:50:45Z
Fill in documentation + miscellaneous minor changes
For instance, this adds ability to throw away old stage graphs.
commit 71281fa15d3bebac583e93ff84c5062f760b753d
Author: Andrew Or <[email protected]>
Date: 2015-04-27T22:40:52Z
Embed the viz in the UI in a toggleable manner
commit 09d361eb53a98d758891f3db39d8c9d4c239ee88
Author: Andrew Or <[email protected]>
Date: 2015-04-27T23:42:19Z
Add ID to node label (minor)
commit c07647c27e656e8538797b2e2581ed99290e4b5c
Author: Andrew Or <[email protected]>
Date: 2015-04-27T23:21:04Z
Re-implement scopes using closures instead of annotations
The problem with annotations is that there is no way to associate
an RDD's scope with another's. This is because the stack trace
simply does not expose enough information for us to associate one
instance of a method invocation with another.
So, we're back to closures. Note that this still suffers from the
same not serializable issue previously discussed, and this is being
fixed in the ClosureCleaner separately.
commit 47b28b3aa14901e674cb7465a420c3256e622076
Author: Andrew Or <[email protected]>
Date: 2015-04-27T23:34:31Z
Ensure that HadoopRDD is actually serializable
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]