Hi dev@spark, I wanted to quickly ping about Spree <http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/>, a live-updating web UI for Spark that I released on Friday (along with some supporting infrastructure), and mention a couple things that came up while I worked on it that are relevant to this list.
This blog post <http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/> and github <https://github.com/hammerlab/spree/> have lots of info about functionality, implementation details, and installation instructions, but the tl;dr is: - You register a SparkListener called JsonRelay <https://github.com/hammerlab/spark-json-relay> via the spark.extraListeners conf (thanks @JoshRosen!). - That listener ships SparkListenerEvents to a server called slim <https://github.com/hammerlab/slim> that stores them in Mongo. - Really what it stores are a bunch of stats similar to those maintained by JobProgressListener. - A Meteor <https://www.meteor.com/> app displays live-updating views of what’s in Mongo. Feel free to read about it / try it! but the rest of this email is just questions about Spark APIs and plans. JsonProtocol scoping The most awkward thing about Spree is that JsonRelay declares itself to be in org.apache.spark <https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L1> so that it can use JsonProtocol. Will JsonProtocol be private[spark] forever, on purpose, or is it just not considered stable enough yet, so you want to discourage direct use? I’m relatively impartial at this point since I’ve done the hacky thing and it works for my purposes, but thought I’d ask in case there are interesting perspectives on the ideal scope for it going forward. @DeveloperApi trait SparkListener Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness of SparkListener <https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L131-L132>. I assumed I was doing something frowny by having JsonRelay implement the SparkListener interface. However, I just noticed that I’m actually extending SparkFirehoseListener <https://github.com/apache/spark/blob/v1.4.1/core/src/main/java/org/apache/spark/SparkFirehoseListener.java>, which is *not* @DeveloperApi afaict, so maybe I’m ok there after all? Are there other SparkListener implementations of note in the wild (seems like “no”)? Is that an API that people can and should use externally (seems like “yes” to me)? I saw @vanzin recently imply on this list that the answers may be “no” and “no” <http://apache-spark-developers-list.1001551.n3.nabble.com/Slight-API-incompatibility-caused-by-SPARK-4072-tp13257.html> . Augmenting JsonProtocol JsonRelay does two things that JsonProtocol does not: - adds an appId field to all events; this makes it possible/easy for downstream things (slim, in this case) to handle information about multiple Spark applications. - JSON-serializes SparkListenerExecutorMetricsUpdate events. This was added to JsonProtocol in SPARK-9036 <https://issues.apache.org/jira/browse/SPARK-9036> (though it’s unused in the Spark repo currently), but I’ll have to leave my version in as long as I want to support Spark <= 1.4.1. - From one perspective, JobProgressListener was sort of “cheating” by using these events that were previously not accessible via JsonProtocol. It seems like making an effort to let external tools get the same kinds of data as the internal listeners is a good principle to try to maintain, which is also relevant to the scoping questions about JsonProtocol above. Should JsonProtocol add appIds to all events itself? Should Spark make it easier for downstream things to to process events from multiple Spark applications? JsonRelay currently pulls the app ID out of the SparkConf that it is instantiated with <https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L16>; it works, but also feels hacky and like maybe I’m doing things I’m not supposed to. Thrift SparkListenerEvent Implementation? A few months ago I built a first version of this project involving a SparkListener called Spear <https://github.com/hammerlab/spear> that aggregated stats from SparkListenerEvents *and* wrote those stats to Mongo, combining JsonRelay and slim from above. Spear used a couple of libraries (Rogue <https://github.com/foursquare/rogue> and Spindle <https://github.com/foursquare/spindle>) to define schemas in thrift, generate Scala for those classes, and do all the Mongo querying in a nice, type-safe way. Unfortunately for me, all of the Mongo queries were synchronous in that implementation, which led to events being dropped <https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40> when I tested it on large jobs (thanks a lot to @squito for helping debug that). I got spooked and decided all possible work had to be removed from the Spark driver, hence JsonRelay doing the most minimal possible thing and sending the events to a network address. I also decided to rewrite my JobProgressListener-equivalent piece in Node rather than Scala because I was finding the type-safety a little constraining and Rogue and Spindle are no longer maintained. Anyways, a couple of things I want to call out, given this back-story: - There is a bunch of boilerplate related to implementing my Mongo schemas sitting abandoned in Spear, in thrift <https://github.com/hammerlab/spear/blob/5f2affe80fae833a120eb63d51154c3c00ee57b0/src/main/thrift/spark.thrift> and case classes <https://github.com/hammerlab/spear/blob/6f9efa9aab0ce00fe229083ded237199aafd9b74/src/main/scala/org/hammerlab/spear/SparkIDL.scala> . - One thing on my roadmap at the time was to replicate the SparkListenerEvents themselves in thrift and/or case classes. - They seem like things you’d want to be able to be able to send in and out of various services/languages easily. - Writing the raw events to Mongo is a related goal for slim, covered by slim#35 <https://github.com/hammerlab/slim/issues/35>. - It’s probably not super difficult to take something like Spear and make it use non-blocking Mongo queries. - This might make it quick enough to not have to drop events. - It would simplify the infrastructure story by folding slim into the listener that is registered with the driver. Possible Upstreaming I don’t know what the interest level for getting any of this functionality (e.g. live-updating web UI, event-stats persistence to Mongo) upstreamed is and don’t have time to work on that, so I won’t go into much detail here but if anyone’s interested I have some ideas about how some of it could happen (e.g. DDP <https://www.meteor.com/ddp> server implementation in java/scala serving the web UI; Spree currently involves two JS servers so some rewriting of things would probably have to happen), why it might be good to do, and why it might be not good or not worth it (e.g. Spark should make sure it’s possible and easy to do sophisticated things like this outside of the Spark repo, putting more work in the driver process is a bad idea, etc.). OK, that’s my brain dump, I’d love to hear peoples’ thoughts on any/all of this, otherwise thanks for the APIs and sorry for having to cheat them a bit! :) -Ryan