On Thu, Oct 6, 2016 at 12:37 PM, Michael Armbrust <mich...@databricks.com <javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote: > > [snip!] > Relatedly, I'm curious to hear more about the types of questions you are > getting. I think the dev list is a good place to discuss applications and > if/how structured streaming can handle them. >
Details are difficult to share, but I can give the general gist without revealing anything proprietary. I find myself having the same conversation about twice a month. The other party to the conversation is an IBM product group or an IBM client who is using Spark for batch and interactive analytics. Their overall application has or will soon have a real-time component. They want information from the IBM Spark Technology Center on the relative merits of different streaming systems for that part of the application or product. Usually, the options on the table are Spark Streaming/Structured Streaming and another more "pure" streaming system like Apache Flink or IBM Streams. Right now, the best recommendation I can give is: "Spark Streaming has known shortcomings; here's a list. If you are certain that your application can work within these constraints, then we recommend you give Spark Streaming a try. Otherwise, check back 12-18 months from now, when Structured Streaming will hopefully provide a usable platform for your application." The specific unmet requirements that are most relevant to these conversations are: latency, throughput, stability under bursty loads, complex event processing support, state management, job graph scheduling, and access to the newer Dataset-based Spark APIs. Again, apologies for not being able to name names, but here's a redacted description of why these requirements are relevant. *Delivering low latency, high throughput, and stability simultaneously:* Right now, our own tests indicate you can get at most two of these characteristics out of Spark Streaming at the same time. I know of two parties that have abandoned Spark Streaming because "pick any two" is not an acceptable answer to the latency/throughput/stability question for them. *Complex event processing and state management:* Several groups I've talked to want to run a large number (tens or hundreds of thousands now, millions in the near future) of state machines over low-rate partitions of a high-rate stream. Covering these use cases translates roughly into a three sub-requirements: maintaining lots of persistent state efficiently, feeding tuples to each state machine in the right order, and exposing convenient programmer APIs for complex event detection and signal processing tasks. *Job graph scheduling and access to Dataset APIs: *These requirements come up in the context of groups who want to do streaming ETL. The general application profile that I've seen involves updating a large number of materialized views based on a smaller number of streams, using a mixture of SQL and nontraditional processing. The extended SQL that the Dataset APIs provide is useful in these applications. As for scheduling needs, it's common for multiple output tables to share intermediate computations. Users need an easy way to ensure that this shared computation happens only once, while controlling the peak memory utilization during each batch. Hope this helps. Fred