StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Holden Karau Mon, 26 Sep 2016 13:39:08 -0700

Hi Spark Developers,

After some discussion on SPARK-16407
<https://issues.apache.org/jira/browse/SPARK-16407> (and on the PR
<https://github.com/apache/spark/pull/14691>) we’ve decided to jump back to
the developer list (SPARK-16407
<https://issues.apache.org/jira/browse/SPARK-16407> itself comes from our
early work on SPARK-16424
<https://issues.apache.org/jira/browse/SPARK-16424> to enable ML with the
new Structured Streaming API). SPARK-16407 is proposing to extend the
current DataStreamWriter API to allow users to specify a specific instance
of a StreamSinkProvider
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.sources.StreamSinkProvider>
- this makes it easier for users to create sinks that are configured with
things besides strings (for example things like lambdas). An example of
something like this already inside Spark is the ForeachSink.


We have been working on adding support for online learning in Structured
Streaming, similar to what Spark Streaming and MLLib provide today. Details
are available in  SPARK-16424
<https://issues.apache.org/jira/browse/SPARK-16424>. Along the way, we
noticed that there is currently no way for code running in the driver to
access the streaming output of a Structured Streaming query (in our case
ideally as an Dataset or RDD - but regardless of the underlying data
structure). In our specific case, we wanted to update a model in the driver
using aggregates computed by a Structured Streaming query.

A lot of other applications are going to have similar requirements. For
example, there is no way (outside of using private Spark internals)* to
implement a console sink with a user supplied formatting function, or
configure a templated or generic sink at runtime, trigger a custom Python
call-back or even implement the ForeachSink outside of Spark. For work
inside of Spark to enable Structured Streaming with ML we clearly don’t
need SPARK-16407 <https://issues.apache.org/jira/browse/SPARK-16407> as we
can directly access the internals (although it would be cleaner to not have
to) but if we want to empower people working outside of the Spark codebase
itself with Structured Streaming I think we need to provide some mechanism
for this and it would be great to see what options/ideas the community can
come up with.

One of the arguments against SPARK-16407
<https://issues.apache.org/jira/browse/SPARK-16407> seems to be mostly that
it exposes the Sink API which is implemented using micro-batching, but the
counter argument to this is that the Sink API is already exposed (instead
of passing in an instance the user needs to pass in a class name which is
then created through reflection and has configuration parameters passed in
as a map of strings).

Personally I think we should exposed a more nicely typed API instead of
depending on Strings for all configuration, and that if at some point the
Sink API itself needs to change if/when Spark Streaming moves away from
micro-batching we would still likely want to allow users to provide the
typed interface as well to give Sink creators more flexibility with
configuration.

Now obviously this is based on my understanding of the lay of the land
which could be a little off since the Spark Structured Streaming design
docs and JIRAs don’t seem to be being actively updated - so I’d love to
know what assumptions I’ve made that don’t match the current plans for
structured streaming.

Cheers,

Holden :)

Related Links:

   -

   The JIRA for this proposal
   https://issues.apache.org/jira/browse/SPARK-16407
   -

   The Structured Streaming ML JIRA
   https://issues.apache.org/jira/browse/SPARK-16424
   -


   
https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing
   -

   https://github.com/apache/spark/pull/14691
   -

   https://github.com/holdenk/spark-structured-streaming-ml


*Strictly speaking one _could_ pass in a string of Java code and then
compile it inside the Sink with Janino - but that clearly isn’t reasonable.

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Reply via email to