[jira] [Resolved] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Bryan Cutler (JIRA) Fri, 22 Sep 2017 09:56:42 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Cutler resolved SPARK-21404.
----------------------------------
    Resolution: Fixed

This has been merged as SPARK-21190

> Simple Vectorized Python UDFs using Arrow
> -----------------------------------------
>
>                 Key: SPARK-21404
>                 URL: https://issues.apache.org/jira/browse/SPARK-21404
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Using Arrow, Python UDFs can be evaluated in vectorized form by using the 
> column data as Pandas.Series.  This will offer a performance gain by 
> computing the return column data in one operation instead of iterating over 
> each row to calculate a single element and appending to a list, as is 
> currently done.  The existing Python UDF api can be used to implement this, 
> which specifies the return type, and since not all functions may be able to 
> be vectorized there would need to be a way to enable this optimizaiton, such 
> as a SQLConf.
> This is designed as a preliminary step for the existing SPIP: Vectorized UDFs 
> in Python SPARK-21190 that could be used as a basis for whatever expanded API 
> is decided upon there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Reply via email to