+1 (non-binding) On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cutl...@gmail.com> wrote:
> +1 (non-binding) for the goals and non-goals of this SPIP. I think it's > fine to work out the minor details of the API during review. > > Bryan > > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ues...@happy-camper.st> > wrote: > >> Hi all, >> >> Thank you for voting and suggestions. >> >> As Wenchen mentioned and also we're discussing at JIRA, we need to >> discuss the size hint for the 0-parameter UDF. >> But I believe we got a consensus about the basic APIs except for the size >> hint, I'd like to submit a pr based on the current proposal and continue >> discussing in its review. >> >> https://github.com/apache/spark/pull/19147 >> >> I'd keep this vote open to wait for more opinions. >> >> Thanks. >> >> >> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> +1 on the design and proposed API. >>> >>> One detail I'd like to discuss is the 0-parameter UDF, how we can >>> specify the size hint. This can be done in the PR review though. >>> >>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com> >>> wrote: >>> >>>> +1 on this and like the suggestion of type in string form. >>>> >>>> Would it be correct to assume there will be data type check, for >>>> example the returned pandas data frame column data types match what are >>>> specified. We have seen quite a bit of issues/confusions with that in R. >>>> >>>> Would it make sense to have a more generic decorator name so that it >>>> could also be useable for other efficient vectorized format in the future? >>>> Or do we anticipate the decorator to be format specific and will have more >>>> in the future? >>>> >>>> ------------------------------ >>>> *From:* Reynold Xin <r...@databricks.com> >>>> *Sent:* Friday, September 1, 2017 5:16:11 AM >>>> *To:* Takuya UESHIN >>>> *Cc:* spark-dev >>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python >>>> >>>> Ok, thanks. >>>> >>>> +1 on the SPIP for scope etc >>>> >>>> >>>> On API details (will deal with in code reviews as well but leaving a >>>> note here in case I forget) >>>> >>>> 1. I would suggest having the API also accept data type specification >>>> in string form. It is usually simpler to say "long" then "LongType()". >>>> >>>> 2. Think about what error message to show when the rows numbers don't >>>> match at runtime. >>>> >>>> >>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> >>>> wrote: >>>> >>>>> Yes, the aggregation is out of scope for now. >>>>> I think we should continue discussing the aggregation at JIRA and we >>>>> will be adding those later separately. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Is the idea aggregate is out of scope for the current effort and we >>>>>> will be adding those later? >>>>>> >>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> We've been discussing to support vectorized UDFs in Python and we >>>>>>> almost got a consensus about the APIs, so I'd like to summarize and >>>>>>> call for a vote. >>>>>>> >>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not >>>>>>> APIs for vectorized UDAFs or Window operations. >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-21190 >>>>>>> >>>>>>> >>>>>>> *Proposed API* >>>>>>> >>>>>>> We introduce a @pandas_udf decorator (or annotation) to define >>>>>>> vectorized UDFs which takes one or more pandas.Series or one >>>>>>> integer value meaning the length of the input value for 0-parameter >>>>>>> UDFs. >>>>>>> The return value should be pandas.Series of the specified type and >>>>>>> the length of the returned value should be the same as input value. >>>>>>> >>>>>>> We can define vectorized UDFs as: >>>>>>> >>>>>>> @pandas_udf(DoubleType()) >>>>>>> def plus(v1, v2): >>>>>>> return v1 + v2 >>>>>>> >>>>>>> or we can define as: >>>>>>> >>>>>>> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >>>>>>> >>>>>>> We can use it similar to row-by-row UDFs: >>>>>>> >>>>>>> df.withColumn('sum', plus(df.v1, df.v2)) >>>>>>> >>>>>>> As for 0-parameter UDFs, we can define and use as: >>>>>>> >>>>>>> @pandas_udf(LongType()) >>>>>>> def f0(size): >>>>>>> return pd.Series(1).repeat(size) >>>>>>> >>>>>>> df.select(f0()) >>>>>>> >>>>>>> >>>>>>> >>>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>>> vote: >>>>>>> >>>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>>> +0: Don't really care. >>>>>>> -1: I don't think this is a good idea because of the following technical >>>>>>> reasons. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -- >>>>>>> Takuya UESHIN >>>>>>> Tokyo, Japan >>>>>>> >>>>>>> http://twitter.com/ueshin >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Takuya UESHIN >>>>> Tokyo, Japan >>>>> >>>>> http://twitter.com/ueshin >>>>> >>>> >>> >> >> >> -- >> Takuya UESHIN >> Tokyo, Japan >> >> http://twitter.com/ueshin >> > > -- Sameer Agarwal Software Engineer | Databricks Inc. http://cs.berkeley.edu/~sameerag