Re: StructuredStreaming status

Michael Armbrust Thu, 20 Oct 2016 12:23:52 -0700

>
> On a personal note, I'm quite surprised that this is all the progress in
> Structured Streaming over the last three months since 2.0 was released. I
> was under the impression that this was one of the biggest things that the
> Spark community actively works on, but that is clearly not the case, given
> that most of the activity is a couple of (very important) JIRAs from the
> last several weeks. Not really sure how to parse that yet...
> I think having some clearer, prioritized roadmap going forward will be a
> good first to recalibrate expectations for 2.2 and for graduating from an
> alpha state.
>


I totally agree we should spend more time making sure the roadmap is clear
to everyone, but I disagree with this characterization.  There is a lot of
work happening in Structured Streaming. In this next release (2.1 as well
as 2.0.1 and 2.0.2) it has been more about stability and scalability rather
than user visible features.  We are running it for real on production jobs
and working to make it rock solid (Everyone can help here!). Just look at the
list of commits
<https://github.com/apache/spark/commits/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming>
.

Regarding the timeline to graduation, I think its instructive to look at
what happened with Spark SQL.

 - Spark 1.0 - added to Spark
 - Spark 1.1 - basic apis, and stability
 - Spark 1.2 - stabilization of Data Source APIs for plugging in external
sources
 - Spark 1.3 - GA
 - Spark 1.4-1.5 - Tungsten
 - Spark 1.6 - Fully-codegened / memory managed
 - Spark 2.0 - Whole stage codegen, experimental streaming support

We probably won't follow that exactly, and we clearly are not done yet.
However, I think the trajectory is good.

But Streaming Query sources
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L41>
>  are
> still designed with microbatches in mind, can this be removed and leave
> offset tracking to the executors?


It certainly could be, but what Matei is saying is that user code should be
able to seamlessly upgrade.  A lot of early focus and thought was towards
this goal.  However, these kinds of concerns are exactly why I think it is
premature to expose these internal APIs to end users. Lets build several
Sources and Sinks internally, and figure out what works and what doesn't.
Spark SQL had JSON, Hive, Parquet, and RDDs before we opened up the APIs.
This experience allowed us keep the Data Source API stable into 2.x and
build a large library of connectors.

Re: StructuredStreaming status

Reply via email to