+1 to awesome work. I saw it this morning and it solves a annoyance/problem I have had for a while, and even thought about maybe contributing something for. I am excited to give it a try.
Pedro On Mon, Jul 27, 2015 at 2:59 PM, Ryan Williams < ryan.blake.willi...@gmail.com> wrote: > Hi dev@spark, I wanted to quickly ping about Spree > <http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/>, > a live-updating web UI for Spark that I released on Friday (along with some > supporting infrastructure), and mention a couple things that came up while > I worked on it that are relevant to this list. > > This blog post > <http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/> > and github <https://github.com/hammerlab/spree/> have lots of info about > functionality, implementation details, and installation instructions, but > the tl;dr is: > > - You register a SparkListener called JsonRelay > <https://github.com/hammerlab/spark-json-relay> via the > spark.extraListeners conf (thanks @JoshRosen!). > - That listener ships SparkListenerEvents to a server called slim > <https://github.com/hammerlab/slim> that stores them in Mongo. > - Really what it stores are a bunch of stats similar to those > maintained by JobProgressListener. > - A Meteor <https://www.meteor.com/> app displays live-updating views > of what’s in Mongo. > > Feel free to read about it / try it! but the rest of this email is just > questions about Spark APIs and plans. > JsonProtocol scoping > > The most awkward thing about Spree is that JsonRelay declares itself to > be in org.apache.spark > <https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L1> > so that it can use JsonProtocol. > > Will JsonProtocol be private[spark] forever, on purpose, or is it just > not considered stable enough yet, so you want to discourage direct use? I’m > relatively impartial at this point since I’ve done the hacky thing and it > works for my purposes, but thought I’d ask in case there are interesting > perspectives on the ideal scope for it going forward. > @DeveloperApi trait SparkListener > > Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness > of SparkListener > <https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L131-L132>. > I assumed I was doing something frowny by having JsonRelay implement the > SparkListener interface. However, I just noticed that I’m actually > extending SparkFirehoseListener > <https://github.com/apache/spark/blob/v1.4.1/core/src/main/java/org/apache/spark/SparkFirehoseListener.java>, > which is *not* @DeveloperApi afaict, so maybe I’m ok there after all? > > Are there other SparkListener implementations of note in the wild (seems > like “no”)? Is that an API that people can and should use externally (seems > like “yes” to me)? I saw @vanzin recently imply on this list that the > answers may be “no” and “no” > <http://apache-spark-developers-list.1001551.n3.nabble.com/Slight-API-incompatibility-caused-by-SPARK-4072-tp13257.html> > . > Augmenting JsonProtocol > > JsonRelay does two things that JsonProtocol does not: > > - adds an appId field to all events; this makes it possible/easy for > downstream things (slim, in this case) to handle information about > multiple Spark applications. > - JSON-serializes SparkListenerExecutorMetricsUpdate events. This was > added to JsonProtocol in SPARK-9036 > <https://issues.apache.org/jira/browse/SPARK-9036> (though it’s unused > in the Spark repo currently), but I’ll have to leave my version in as long > as I want to support Spark <= 1.4.1. > - From one perspective, JobProgressListener was sort of “cheating” > by using these events that were previously not accessible via > JsonProtocol. > > It seems like making an effort to let external tools get the same kinds of > data as the internal listeners is a good principle to try to maintain, > which is also relevant to the scoping questions about JsonProtocol above. > > Should JsonProtocol add appIds to all events itself? Should Spark make it > easier for downstream things to to process events from multiple Spark > applications? JsonRelay currently pulls the app ID out of the SparkConf > that it is instantiated with > <https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L16>; > it works, but also feels hacky and like maybe I’m doing things I’m not > supposed to. > Thrift SparkListenerEvent Implementation? > > A few months ago I built a first version of this project involving a > SparkListener called Spear <https://github.com/hammerlab/spear> that > aggregated stats from SparkListenerEvents *and* wrote those stats to > Mongo, combining JsonRelay and slim from above. > > Spear used a couple of libraries (Rogue > <https://github.com/foursquare/rogue> and Spindle > <https://github.com/foursquare/spindle>) to define schemas in thrift, > generate Scala for those classes, and do all the Mongo querying in a nice, > type-safe way. > > Unfortunately for me, all of the Mongo queries were synchronous in that > implementation, which led to events being dropped > <https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40> > when I tested it on large jobs (thanks a lot to @squito for helping debug > that). I got spooked and decided all possible work had to be removed from > the Spark driver, hence JsonRelay doing the most minimal possible thing > and sending the events to a network address. > > I also decided to rewrite my JobProgressListener-equivalent piece in Node > rather than Scala because I was finding the type-safety a little > constraining and Rogue and Spindle are no longer maintained. > > Anyways, a couple of things I want to call out, given this back-story: > > - There is a bunch of boilerplate related to implementing my Mongo > schemas sitting abandoned in Spear, in thrift > > <https://github.com/hammerlab/spear/blob/5f2affe80fae833a120eb63d51154c3c00ee57b0/src/main/thrift/spark.thrift> > and case classes > > <https://github.com/hammerlab/spear/blob/6f9efa9aab0ce00fe229083ded237199aafd9b74/src/main/scala/org/hammerlab/spear/SparkIDL.scala> > . > - One thing on my roadmap at the time was to replicate the > SparkListenerEvents themselves in thrift and/or case classes. > - They seem like things you’d want to be able to be able to send in > and out of various services/languages easily. > - Writing the raw events to Mongo is a related goal for slim, > covered by slim#35 <https://github.com/hammerlab/slim/issues/35>. > - It’s probably not super difficult to take something like Spear and > make it use non-blocking Mongo queries. > - This might make it quick enough to not have to drop events. > - It would simplify the infrastructure story by folding slim into > the listener that is registered with the driver. > > Possible Upstreaming > > I don’t know what the interest level for getting any of this functionality > (e.g. live-updating web UI, event-stats persistence to Mongo) upstreamed is > and don’t have time to work on that, so I won’t go into much detail here > but if anyone’s interested I have some ideas about how some of it could > happen (e.g. DDP <https://www.meteor.com/ddp> server implementation in > java/scala serving the web UI; Spree currently involves two JS servers so > some rewriting of things would probably have to happen), why it might be > good to do, and why it might be not good or not worth it (e.g. Spark should > make sure it’s possible and easy to do sophisticated things like this > outside of the Spark repo, putting more work in the driver process is a bad > idea, etc.). > > OK, that’s my brain dump, I’d love to hear peoples’ thoughts on any/all of > this, otherwise thanks for the APIs and sorry for having to cheat them a > bit! :) > > -Ryan > > -- Pedro Rodriguez UCBerkeley 2014 | Computer Science SnowGeek <http://SnowGeek.org> pedro-rodriguez.com ski.rodrig...@gmail.com 208-340-1703