Hi Spark Developers, After some discussion on SPARK-16407 <https://issues.apache.org/jira/browse/SPARK-16407> (and on the PR <https://github.com/apache/spark/pull/14691>) we’ve decided to jump back to the developer list (SPARK-16407 <https://issues.apache.org/jira/browse/SPARK-16407> itself comes from our early work on SPARK-16424 <https://issues.apache.org/jira/browse/SPARK-16424> to enable ML with the new Structured Streaming API). SPARK-16407 is proposing to extend the current DataStreamWriter API to allow users to specify a specific instance of a StreamSinkProvider <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.sources.StreamSinkProvider> - this makes it easier for users to create sinks that are configured with things besides strings (for example things like lambdas). An example of something like this already inside Spark is the ForeachSink.
We have been working on adding support for online learning in Structured Streaming, similar to what Spark Streaming and MLLib provide today. Details are available in SPARK-16424 <https://issues.apache.org/jira/browse/SPARK-16424>. Along the way, we noticed that there is currently no way for code running in the driver to access the streaming output of a Structured Streaming query (in our case ideally as an Dataset or RDD - but regardless of the underlying data structure). In our specific case, we wanted to update a model in the driver using aggregates computed by a Structured Streaming query. A lot of other applications are going to have similar requirements. For example, there is no way (outside of using private Spark internals)* to implement a console sink with a user supplied formatting function, or configure a templated or generic sink at runtime, trigger a custom Python call-back or even implement the ForeachSink outside of Spark. For work inside of Spark to enable Structured Streaming with ML we clearly don’t need SPARK-16407 <https://issues.apache.org/jira/browse/SPARK-16407> as we can directly access the internals (although it would be cleaner to not have to) but if we want to empower people working outside of the Spark codebase itself with Structured Streaming I think we need to provide some mechanism for this and it would be great to see what options/ideas the community can come up with. One of the arguments against SPARK-16407 <https://issues.apache.org/jira/browse/SPARK-16407> seems to be mostly that it exposes the Sink API which is implemented using micro-batching, but the counter argument to this is that the Sink API is already exposed (instead of passing in an instance the user needs to pass in a class name which is then created through reflection and has configuration parameters passed in as a map of strings). Personally I think we should exposed a more nicely typed API instead of depending on Strings for all configuration, and that if at some point the Sink API itself needs to change if/when Spark Streaming moves away from micro-batching we would still likely want to allow users to provide the typed interface as well to give Sink creators more flexibility with configuration. Now obviously this is based on my understanding of the lay of the land which could be a little off since the Spark Structured Streaming design docs and JIRAs don’t seem to be being actively updated - so I’d love to know what assumptions I’ve made that don’t match the current plans for structured streaming. Cheers, Holden :) Related Links: - The JIRA for this proposal https://issues.apache.org/jira/browse/SPARK-16407 - The Structured Streaming ML JIRA https://issues.apache.org/jira/browse/SPARK-16424 - https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing - https://github.com/apache/spark/pull/14691 - https://github.com/holdenk/spark-structured-streaming-ml *Strictly speaking one _could_ pass in a string of Java code and then compile it inside the Sink with Janino - but that clearly isn’t reasonable. -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau