This is super helpful, thanks for writing it up!
> *Delivering low latency, high throughput, and stability simultaneously:* Right > now, our own tests indicate you can get at most two of these > characteristics out of Spark Streaming at the same time. I know of two > parties that have abandoned Spark Streaming because "pick any two" is not > an acceptable answer to the latency/throughput/stability question for them. > Agree, this should be the major focus. > *Complex event processing and state management:* Several groups I've > talked to want to run a large number (tens or hundreds of thousands now, > millions in the near future) of state machines over low-rate partitions of > a high-rate stream. Covering these use cases translates roughly into a > three sub-requirements: maintaining lots of persistent state efficiently, > feeding tuples to each state machine in the right order, and exposing > convenient programmer APIs for complex event detection and signal > processing tasks. > I've heard this one too, but don't know of anyone actively working on it. Would be awesome to open a JIRA and start discussing what the APIs would look like. > *Job graph scheduling and access to Dataset APIs: *These requirements > come up in the context of groups who want to do streaming ETL. The general > application profile that I've seen involves updating a large number of > materialized views based on a smaller number of streams, using a mixture of > SQL and nontraditional processing. The extended SQL that the Dataset APIs > provide is useful in these applications. As for scheduling needs, it's > common for multiple output tables to share intermediate computations. Users > need an easy way to ensure that this shared computation happens only once, > while controlling the peak memory utilization during each batch. > This sounds like two separate things to me. High-level APIs (are streaming DataFrames / Datasets missing anything?) and multi-query optimization for streams. I've been thinking about the latter. I think we probably want to crush latency/throughput/stability in the simpler case first, but after that I think there is a lot of machinery already in SQL we can reuse (i.e. the sameResult calculations used for caching).