[
https://issues.apache.org/jira/browse/SPARK-21404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Cutler resolved SPARK-21404.
----------------------------------
Resolution: Fixed
This has been merged as SPARK-21190
> Simple Vectorized Python UDFs using Arrow
> -----------------------------------------
>
> Key: SPARK-21404
> URL: https://issues.apache.org/jira/browse/SPARK-21404
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 2.3.0
> Reporter: Bryan Cutler
>
> Using Arrow, Python UDFs can be evaluated in vectorized form by using the
> column data as Pandas.Series. This will offer a performance gain by
> computing the return column data in one operation instead of iterating over
> each row to calculate a single element and appending to a list, as is
> currently done. The existing Python UDF api can be used to implement this,
> which specifies the return type, and since not all functions may be able to
> be vectorized there would need to be a way to enable this optimizaiton, such
> as a SQLConf.
> This is designed as a preliminary step for the existing SPIP: Vectorized UDFs
> in Python SPARK-21190 that could be used as a basis for whatever expanded API
> is decided upon there.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]