Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Takuya UESHIN Wed, 06 Sep 2017 05:18:10 -0700

Hi all,

Thank you for voting and suggestions.


As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
the size hint for the 0-parameter UDF.
But I believe we got a consensus about the basic APIs except for the size
hint, I'd like to submit a pr based on the current proposal and continue
discussing in its review.

    https://github.com/apache/spark/pull/19147

I'd keep this vote open to wait for more opinions.

Thanks.


On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <[email protected]> wrote:

> +1 on the design and proposed API.
>
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
> the size hint. This can be done in the PR review though.
>
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <[email protected]>
> wrote:
>
>> +1 on this and like the suggestion of type in string form.
>>
>> Would it be correct to assume there will be data type check, for example
>> the returned pandas data frame column data types match what are specified.
>> We have seen quite a bit of issues/confusions with that in R.
>>
>> Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the future?
>> Or do we anticipate the decorator to be format specific and will have more
>> in the future?
>>
>> ------------------------------
>> *From:* Reynold Xin <[email protected]>
>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>> *To:* Takuya UESHIN
>> *Cc:* spark-dev
>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>
>> Ok, thanks.
>>
>> +1 on the SPIP for scope etc
>>
>>
>> On API details (will deal with in code reviews as well but leaving a note
>> here in case I forget)
>>
>> 1. I would suggest having the API also accept data type specification in
>> string form. It is usually simpler to say "long" then "LongType()".
>>
>> 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>>
>>
>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <[email protected]>
>> wrote:
>>
>>> Yes, the aggregation is out of scope for now.
>>> I think we should continue discussing the aggregation at JIRA and we
>>> will be adding those later separately.
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <[email protected]> wrote:
>>>
>>>> Is the idea aggregate is out of scope for the current effort and we
>>>> will be adding those later?
>>>>
>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>>> call for a vote.
>>>>>
>>>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>>>> for vectorized UDAFs or Window operations.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>
>>>>>
>>>>> *Proposed API*
>>>>>
>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>>> return value should be pandas.Series of the specified type and the
>>>>> length of the returned value should be the same as input value.
>>>>>
>>>>> We can define vectorized UDFs as:
>>>>>
>>>>>   @pandas_udf(DoubleType())
>>>>>   def plus(v1, v2):
>>>>>       return v1 + v2
>>>>>
>>>>> or we can define as:
>>>>>
>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>
>>>>> We can use it similar to row-by-row UDFs:
>>>>>
>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>
>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>
>>>>>   @pandas_udf(LongType())
>>>>>   def f0(size):
>>>>>       return pd.Series(1).repeat(size)
>>>>>
>>>>>   df.select(f0())
>>>>>
>>>>>
>>>>>
>>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>>
>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>> +0: Don't really care.
>>>>> -1: I don't think this is a good idea because of the following technical
>>>>> reasons.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --
>>>>> Takuya UESHIN
>>>>> Tokyo, Japan
>>>>>
>>>>> http://twitter.com/ueshin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to