[jira] [Updated] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Bryan Cutler (JIRA) Thu, 13 Jul 2017 14:28:17 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Cutler updated SPARK-21404:
---------------------------------
    Description: 
Using Arrow, Python UDFs can be evaluated in vectorized form by using the 
column data as Pandas.Series.  This will offer a performance gain by computing 
the return column data in one operation instead of iterating over each row to 
calculate a single element and appending to a list, as is currently done.  The 
existing Python UDF api can be used to implement this, which specifies the 
return type, and since not all functions may be able to be vectorized there 
would need to be a way to enable this optimizaiton, such as a SQLConf.

This is designed as a preliminary step for the existing SPIP: Vectorized UDFs 
in Python SPARK-21190 that could be used as a basis for whatever expanded API 
is decided upon there.

  was:
Using Arrow, Python UDFs can be evaluated in vectorized form by using the 
column data as Pandas.Series.  This will offer a performance gain by computing 
the return column data in one operation instead of iterating over each row to 
calculate a single element and appending to a list, as is currently done.

This is designed as a preliminary step for the existing SPIP: Vectorized UDFs 
in Python SPARK-21190 that could be used as a basis for whatever expanded API 
is decided upon there.


> Simple Vectorized Python UDFs using Arrow
> -----------------------------------------
>
>                 Key: SPARK-21404
>                 URL: https://issues.apache.org/jira/browse/SPARK-21404
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Using Arrow, Python UDFs can be evaluated in vectorized form by using the 
> column data as Pandas.Series.  This will offer a performance gain by 
> computing the return column data in one operation instead of iterating over 
> each row to calculate a single element and appending to a list, as is 
> currently done.  The existing Python UDF api can be used to implement this, 
> which specifies the return type, and since not all functions may be able to 
> be vectorized there would need to be a way to enable this optimizaiton, such 
> as a SQLConf.
> This is designed as a preliminary step for the existing SPIP: Vectorized UDFs 
> in Python SPARK-21190 that could be used as a basis for whatever expanded API 
> is decided upon there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Reply via email to