[jira] [Commented] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Bryan Cutler (JIRA) Thu, 13 Jul 2017 10:09:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086010#comment-16086010
 ]


Bryan Cutler commented on SPARK-21404:
--------------------------------------

I'll submit the work I've done so far as a WIP PR and open to discussion in 
using this as a first step to an expanded API for vectorized UDFs in SPARK-21190

> Simple Vectorized Python UDFs using Arrow
> -----------------------------------------
>
>                 Key: SPARK-21404
>                 URL: https://issues.apache.org/jira/browse/SPARK-21404
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Using Arrow, Python UDFs can be evaluated in vectorized form by using the 
> column data as Pandas.Series.  This will offer a performance gain by 
> computing the return column data in one operation instead of iterating over 
> each row to calculate a single element and appending to a list, as is 
> currently done.
> This is designed as a preliminary step for the existing SPIP: Vectorized UDFs 
> in Python SPARK-21190 that could be used as a basis for whatever expanded API 
> is decided upon there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Reply via email to