+1
Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道: > +1 as well > > On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> +1 >> >> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> +1 (non-binding) >>> >>> Thanks for making the updates reflected in the current PR. It would be >>> great to see the doc updated before it is finally published though. >>> >>> Right now it feels like this SPIP is focused more on getting the basics >>> right for what many datasources are already doing in API V1 combined with >>> other private APIs, vs pushing forward state of the art for performance. >>> >>> I think that’s the right approach for this SPIP. We can add the support >>> you’re talking about later with a more specific plan that doesn’t block >>> fixing the problems that this addresses. >>> >>> >>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < >>> hvanhov...@databricks.com> wrote: >>> >>>> +1 (binding) >>>> >>>> I personally believe that there is quite a big difference between >>>> having a generic data source interface with a low surface area and pushing >>>> down a significant part of query processing into a datasource. The later >>>> has much wider wider surface area and will require us to stabilize most of >>>> the internal catalyst API's which will be a significant burden on the >>>> community to maintain and has the potential to slow development velocity >>>> significantly. If you want to write such integrations then you should be >>>> prepared to work with catalyst internals and own up to the fact that things >>>> might change across minor versions (and in some cases even maintenance >>>> releases). If you are willing to go down that road, then your best bet is >>>> to use the already existing spark session extensions which will allow you >>>> to write such integrations and can be used as an `escape hatch`. >>>> >>>> >>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> >>>> wrote: >>>> >>>>> +0 (non-binding) >>>>> >>>>> I think there are benefits to unifying all the Spark-internal >>>>> datasources into a common public API for sure. It will serve as a forcing >>>>> function to ensure that those internal datasources aren't advantaged vs >>>>> datasources developed externally as plugins to Spark, and that all Spark >>>>> features are available to all datasources. >>>>> >>>>> But I also think this read-path proposal avoids the more difficult >>>>> questions around how to continue pushing datasource performance forwards. >>>>> James Baker (my colleague) had a number of questions about advanced >>>>> pushdowns (combined sorting and filtering), and Reynold also noted that >>>>> pushdown of aggregates and joins are desirable on longer timeframes as >>>>> well. The Spark community saw similar requests, for aggregate pushdown in >>>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown >>>>> in SPARK-12449. Clearly a number of people are interested in this kind of >>>>> performance work for datasources. >>>>> >>>>> To leave enough space for datasource developers to continue >>>>> experimenting with advanced interactions between Spark and their >>>>> datasources, I'd propose we leave some sort of escape valve that enables >>>>> these datasources to keep pushing the boundaries without forking Spark. >>>>> Possibly that looks like an additional unsupported/unstable interface that >>>>> pushes down an entire (unstable API) logical plan, which is expected to >>>>> break API on every release. (Spark attempts this full-plan pushdown, and >>>>> if that fails Spark ignores it and continues on with the rest of the V2 >>>>> API >>>>> for compatibility). Or maybe it looks like something else that we don't >>>>> know of yet. Possibly this falls outside of the desired goals for the V2 >>>>> API and instead should be a separate SPIP. >>>>> >>>>> If we had a plan for this kind of escape valve for advanced datasource >>>>> developers I'd be an unequivocal +1. Right now it feels like this SPIP is >>>>> focused more on getting the basics right for what many datasources are >>>>> already doing in API V1 combined with other private APIs, vs pushing >>>>> forward state of the art for performance. >>>>> >>>>> Andrew >>>>> >>>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < >>>>> suresh.thalam...@gmail.com> wrote: >>>>> >>>>>> +1 (non-binding) >>>>>> >>>>>> >>>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> In the previous discussion, we decided to split the read and write >>>>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call a >>>>>> vote for Data Source V2 read path only. >>>>>> >>>>>> The full document of the Data Source API V2 is: >>>>>> >>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit >>>>>> >>>>>> The ready-for-review PR that implements the basic infrastructure for >>>>>> the read path is: >>>>>> https://github.com/apache/spark/pull/19136 >>>>>> >>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>> vote: >>>>>> >>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>> +0: Don't really care. >>>>>> -1: I don't think this is a good idea because of the following >>>>>> technical reasons. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Herman van Hövell >>>> >>>> Software Engineer >>>> >>>> Databricks Inc. >>>> >>>> hvanhov...@databricks.com >>>> >>>> +31 6 420 590 27 >>>> >>>> databricks.com >>>> >>>> [image: http://databricks.com] <http://databricks.com/> >>>> >>>> >>>> >>>> [image: Announcing Databricks Serverless. The first serverless data >>>> science and big data platform. Watch the demo from Spark Summit 2017.] >>>> <http://go.databricks.com/announcing-databricks-serverless> >>>> >>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> >