Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

蒋星博 Thu, 07 Sep 2017 12:47:31 -0700

+1


Reynold Xin <[email protected]>于2017年9月7日 周四下午12:04写道：

> +1 as well
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[email protected]>
> wrote:
>
>> +1
>>
>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[email protected]>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks for making the updates reflected in the current PR. It would be
>>> great to see the doc updated before it is finally published though.
>>>
>>> Right now it feels like this SPIP is focused more on getting the basics
>>> right for what many datasources are already doing in API V1 combined with
>>> other private APIs, vs pushing forward state of the art for performance.
>>>
>>> I think that’s the right approach for this SPIP. We can add the support
>>> you’re talking about later with a more specific plan that doesn’t block
>>> fixing the problems that this addresses.
>>> 
>>>
>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>>> [email protected]> wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> I personally believe that there is quite a big difference between
>>>> having a generic data source interface with a low surface area and pushing
>>>> down a significant part of query processing into a datasource. The later
>>>> has much wider wider surface area and will require us to stabilize most of
>>>> the internal catalyst API's which will be a significant burden on the
>>>> community to maintain and has the potential to slow development velocity
>>>> significantly. If you want to write such integrations then you should be
>>>> prepared to work with catalyst internals and own up to the fact that things
>>>> might change across minor versions (and in some cases even maintenance
>>>> releases). If you are willing to go down that road, then your best bet is
>>>> to use the already existing spark session extensions which will allow you
>>>> to write such integrations and can be used as an `escape hatch`.
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[email protected]>
>>>> wrote:
>>>>
>>>>> +0 (non-binding)
>>>>>
>>>>> I think there are benefits to unifying all the Spark-internal
>>>>> datasources into a common public API for sure.  It will serve as a forcing
>>>>> function to ensure that those internal datasources aren't advantaged vs
>>>>> datasources developed externally as plugins to Spark, and that all Spark
>>>>> features are available to all datasources.
>>>>>
>>>>> But I also think this read-path proposal avoids the more difficult
>>>>> questions around how to continue pushing datasource performance forwards.
>>>>> James Baker (my colleague) had a number of questions about advanced
>>>>> pushdowns (combined sorting and filtering), and Reynold also noted that
>>>>> pushdown of aggregates and joins are desirable on longer timeframes as
>>>>> well.  The Spark community saw similar requests, for aggregate pushdown in
>>>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>>>>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>>>>> performance work for datasources.
>>>>>
>>>>> To leave enough space for datasource developers to continue
>>>>> experimenting with advanced interactions between Spark and their
>>>>> datasources, I'd propose we leave some sort of escape valve that enables
>>>>> these datasources to keep pushing the boundaries without forking Spark.
>>>>> Possibly that looks like an additional unsupported/unstable interface that
>>>>> pushes down an entire (unstable API) logical plan, which is expected to
>>>>> break API on every release.   (Spark attempts this full-plan pushdown, and
>>>>> if that fails Spark ignores it and continues on with the rest of the V2 
>>>>> API
>>>>> for compatibility).  Or maybe it looks like something else that we don't
>>>>> know of yet.  Possibly this falls outside of the desired goals for the V2
>>>>> API and instead should be a separate SPIP.
>>>>>
>>>>> If we had a plan for this kind of escape valve for advanced datasource
>>>>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>>>>> focused more on getting the basics right for what many datasources are
>>>>> already doing in API V1 combined with other private APIs, vs pushing
>>>>> forward state of the art for performance.
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>>
>>>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[email protected]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> In the previous discussion, we decided to split the read and write
>>>>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call a
>>>>>> vote for Data Source V2 read path only.
>>>>>>
>>>>>> The full document of the Data Source API V2 is:
>>>>>>
>>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>
>>>>>> The ready-for-review PR that implements the basic infrastructure for
>>>>>> the read path is:
>>>>>> https://github.com/apache/spark/pull/19136
>>>>>>
>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>> vote:
>>>>>>
>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>> +0: Don't really care.
>>>>>> -1: I don't think this is a good idea because of the following
>>>>>> technical reasons.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Herman van Hövell
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks Inc.
>>>>
>>>> [email protected]
>>>>
>>>> +31 6 420 590 27
>>>>
>>>> databricks.com
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>>
>>>>
>>>> [image: Announcing Databricks Serverless. The first serverless data
>>>> science and big data platform. Watch the demo from Spark Summit 2017.]
>>>> <http://go.databricks.com/announcing-databricks-serverless>
>>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Reply via email to