StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Fred Reiss Tue, 11 Oct 2016 10:09:26 -0700

On Thu, Oct 6, 2016 at 12:37 PM, Michael Armbrust <mich...@databricks.com
<javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote:
>
> [snip!]
> Relatedly, I'm curious to hear more about the types of questions you are
> getting.  I think the dev list is a good place to discuss applications and
> if/how structured streaming can handle them.
>


Details are difficult to share, but I can give the general gist without
revealing anything proprietary.

I find myself having the same conversation about twice a month. The other
party to the conversation is an IBM product group or an IBM client who
is using Spark for batch and interactive analytics. Their overall
application has or will soon have a real-time component. They want
information from the IBM Spark Technology Center on the relative merits
of different streaming systems for that part of the application or product.
Usually, the options on the table are Spark Streaming/Structured Streaming
and another more "pure" streaming system like Apache Flink or IBM Streams.

Right now, the best recommendation I can give is: "Spark Streaming has
known shortcomings; here's a list. If you are certain that your application
can work within these constraints, then we recommend you give Spark
Streaming a try. Otherwise, check back 12-18 months from now, when
Structured Streaming will hopefully provide a usable platform for your
application."

The specific unmet requirements that are most relevant to these
conversations are: latency, throughput, stability under bursty loads,
complex event processing support, state management, job graph scheduling,
and access to the newer Dataset-based Spark APIs.

Again, apologies for not being able to name names, but here's a redacted
description of why these requirements are relevant.

*Delivering low latency, high throughput, and stability simultaneously:* Right
now, our own tests indicate you can get at most two of these
characteristics out of Spark Streaming at the same time. I know of two
parties that have abandoned Spark Streaming because "pick any two" is not
an acceptable answer to the latency/throughput/stability question for them.

*Complex event processing and state management:* Several groups I've talked
to want to run a large number (tens or hundreds of thousands now, millions
in the near future) of state machines over low-rate partitions of a
high-rate stream. Covering these use cases translates roughly into a three
sub-requirements: maintaining lots of persistent state efficiently, feeding
tuples to each state machine in the right order, and exposing convenient
programmer APIs for complex event detection and signal processing tasks.

*Job graph scheduling and access to Dataset APIs: *These requirements come
up in the context of groups who want to do streaming ETL. The general
application profile that I've seen involves updating a large number of
materialized views based on a smaller number of streams, using a mixture of
SQL and nontraditional processing. The extended SQL that the Dataset APIs
provide is useful in these applications. As for scheduling needs, it's
common for multiple output tables to share intermediate computations. Users
need an easy way to ensure that this shared computation happens only once,
while controlling the peak memory utilization during each batch.

Hope this helps.

Fred

StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Reply via email to