+1
Xiao Li wrote > +1 > > Xiao > On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia < > matei.zaharia@ > > > wrote: > >> +1 (binding) >> >> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon < > gurwls223@ > > wrote: >> > >> > +1 (non-binding) >> > >> > >> > 2017-09-12 9:52 GMT+09:00 Yin Huai < > yhuai@ > >: >> > +1 >> > >> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal < > sameer@ > > >> wrote: >> > +1 (non-binding) >> > >> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler < > cutlerb@ > > wrote: >> > +1 (non-binding) for the goals and non-goals of this SPIP. I think >> it's >> fine to work out the minor details of the API during review. >> > >> > Bryan >> > >> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN < > ueshin@ > > >> wrote: >> > Hi all, >> > >> > Thank you for voting and suggestions. >> > >> > As Wenchen mentioned and also we're discussing at JIRA, we need to >> discuss the size hint for the 0-parameter UDF. >> > But I believe we got a consensus about the basic APIs except for the >> size hint, I'd like to submit a pr based on the current proposal and >> continue discussing in its review. >> > >> > https://github.com/apache/spark/pull/19147 >> > >> > I'd keep this vote open to wait for more opinions. >> > >> > Thanks. >> > >> > >> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan < > cloud0fan@ > > wrote: >> > +1 on the design and proposed API. >> > >> > One detail I'd like to discuss is the 0-parameter UDF, how we can >> specify the size hint. This can be done in the PR review though. >> > >> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung < > felixcheung_m@ > > >> wrote: >> > +1 on this and like the suggestion of type in string form. >> > >> > Would it be correct to assume there will be data type check, for >> example >> the returned pandas data frame column data types match what are >> specified. >> We have seen quite a bit of issues/confusions with that in R. >> > >> > Would it make sense to have a more generic decorator name so that it >> could also be useable for other efficient vectorized format in the >> future? >> Or do we anticipate the decorator to be format specific and will have >> more >> in the future? >> > >> > From: Reynold Xin < > rxin@ > > >> > Sent: Friday, September 1, 2017 5:16:11 AM >> > To: Takuya UESHIN >> > Cc: spark-dev >> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python >> > >> > Ok, thanks. >> > >> > +1 on the SPIP for scope etc >> > >> > >> > On API details (will deal with in code reviews as well but leaving a >> note here in case I forget) >> > >> > 1. I would suggest having the API also accept data type specification >> in >> string form. It is usually simpler to say "long" then "LongType()". >> > >> > 2. Think about what error message to show when the rows numbers don't >> match at runtime. >> > >> > >> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN < > ueshin@ > > >> wrote: >> > Yes, the aggregation is out of scope for now. >> > I think we should continue discussing the aggregation at JIRA and we >> will be adding those later separately. >> > >> > Thanks. >> > >> > >> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin < > rxin@ > > wrote: >> > Is the idea aggregate is out of scope for the current effort and we >> will >> be adding those later? >> > >> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN < > ueshin@ > > >> wrote: >> > Hi all, >> > >> > We've been discussing to support vectorized UDFs in Python and we >> almost >> got a consensus about the APIs, so I'd like to summarize and call for a >> vote. >> > >> > Note that this vote should focus on APIs for vectorized UDFs, not APIs >> for vectorized UDAFs or Window operations. >> > >> > https://issues.apache.org/jira/browse/SPARK-21190 >> > >> > >> > Proposed API >> > >> > We introduce a @pandas_udf decorator (or annotation) to define >> vectorized UDFs which takes one or more pandas.Series or one integer >> value >> meaning the length of the input value for 0-parameter UDFs. The return >> value should be pandas.Series of the specified type and the length of the >> returned value should be the same as input value. >> > >> > We can define vectorized UDFs as: >> > >> > @pandas_udf(DoubleType()) >> > def plus(v1, v2): >> > return v1 + v2 >> > >> > or we can define as: >> > >> > plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >> > >> > We can use it similar to row-by-row UDFs: >> > >> > df.withColumn('sum', plus(df.v1, df.v2)) >> > >> > As for 0-parameter UDFs, we can define and use as: >> > >> > @pandas_udf(LongType()) >> > def f0(size): >> > return pd.Series(1).repeat(size) >> > >> > df.select(f0()) >> > >> > >> > >> > The vote will be up for the next 72 hours. Please reply with your vote: >> > >> > +1: Yeah, let's go forward and implement the SPIP. >> > +0: Don't really care. >> > -1: I don't think this is a good idea because of the following >> technical >> reasons. >> > >> > Thanks! >> > >> > -- >> > Takuya UESHIN >> > Tokyo, Japan >> > >> > http://twitter.com/ueshin >> > >> > >> > >> > -- >> > Takuya UESHIN >> > Tokyo, Japan >> > >> > http://twitter.com/ueshin >> > >> > >> > >> > >> > -- >> > Takuya UESHIN >> > Tokyo, Japan >> > >> > http://twitter.com/ueshin >> > >> > >> > >> > >> > -- >> > Sameer Agarwal >> > Software Engineer | Databricks Inc. >> > http://cs.berkeley.edu/~sameerag >> > >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: > dev-unsubscribe@.apache >> >> ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org