+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources
into a common public API for sure.  It will serve as a forcing function to
ensure that those internal datasources aren't advantaged vs datasources
developed externally as plugins to Spark, and that all Spark features are
available to all datasources.

But I also think this read-path proposal avoids the more difficult
questions around how to continue pushing datasource performance forwards.
James Baker (my colleague) had a number of questions about advanced
pushdowns (combined sorting and filtering), and Reynold also noted that
pushdown of aggregates and joins are desirable on longer timeframes as
well.  The Spark community saw similar requests, for aggregate pushdown in
SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
in SPARK-12449.  Clearly a number of people are interested in this kind of
performance work for datasources.

To leave enough space for datasource developers to continue experimenting
with advanced interactions between Spark and their datasources, I'd propose
we leave some sort of escape valve that enables these datasources to keep
pushing the boundaries without forking Spark.  Possibly that looks like an
additional unsupported/unstable interface that pushes down an entire
(unstable API) logical plan, which is expected to break API on every
release.   (Spark attempts this full-plan pushdown, and if that fails Spark
ignores it and continues on with the rest of the V2 API for
compatibility).  Or maybe it looks like something else that we don't know
of yet.  Possibly this falls outside of the desired goals for the V2 API
and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource
developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
focused more on getting the basics right for what many datasources are
already doing in API V1 combined with other private APIs, vs pushing
forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
suresh.thalam...@gmail.com> wrote:

> +1 (non-binding)
>
>
> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
> Hi all,
>
> In the previous discussion, we decided to split the read and write path of
> data source v2 into 2 SPIPs, and I'm sending this email to call a vote for
> Data Source V2 read path only.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for the
> read path is:
> https://github.com/apache/spark/pull/19136
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
>
>

Reply via email to