> > On a personal note, I'm quite surprised that this is all the progress in > Structured Streaming over the last three months since 2.0 was released. I > was under the impression that this was one of the biggest things that the > Spark community actively works on, but that is clearly not the case, given > that most of the activity is a couple of (very important) JIRAs from the > last several weeks. Not really sure how to parse that yet... > I think having some clearer, prioritized roadmap going forward will be a > good first to recalibrate expectations for 2.2 and for graduating from an > alpha state. >
I totally agree we should spend more time making sure the roadmap is clear to everyone, but I disagree with this characterization. There is a lot of work happening in Structured Streaming. In this next release (2.1 as well as 2.0.1 and 2.0.2) it has been more about stability and scalability rather than user visible features. We are running it for real on production jobs and working to make it rock solid (Everyone can help here!). Just look at the list of commits <https://github.com/apache/spark/commits/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming> . Regarding the timeline to graduation, I think its instructive to look at what happened with Spark SQL. - Spark 1.0 - added to Spark - Spark 1.1 - basic apis, and stability - Spark 1.2 - stabilization of Data Source APIs for plugging in external sources - Spark 1.3 - GA - Spark 1.4-1.5 - Tungsten - Spark 1.6 - Fully-codegened / memory managed - Spark 2.0 - Whole stage codegen, experimental streaming support We probably won't follow that exactly, and we clearly are not done yet. However, I think the trajectory is good. But Streaming Query sources > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L41> > are > still designed with microbatches in mind, can this be removed and leave > offset tracking to the executors? It certainly could be, but what Matei is saying is that user code should be able to seamlessly upgrade. A lot of early focus and thought was towards this goal. However, these kinds of concerns are exactly why I think it is premature to expose these internal APIs to end users. Lets build several Sources and Sinks internally, and figure out what works and what doesn't. Spark SQL had JSON, Hive, Parquet, and RDDs before we opened up the APIs. This experience allowed us keep the Data Source API stable into 2.x and build a large library of connectors.