[
https://issues.apache.org/jira/browse/SPARK-21404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086010#comment-16086010
]
Bryan Cutler commented on SPARK-21404:
--------------------------------------
I'll submit the work I've done so far as a WIP PR and open to discussion in
using this as a first step to an expanded API for vectorized UDFs in SPARK-21190
> Simple Vectorized Python UDFs using Arrow
> -----------------------------------------
>
> Key: SPARK-21404
> URL: https://issues.apache.org/jira/browse/SPARK-21404
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 2.3.0
> Reporter: Bryan Cutler
>
> Using Arrow, Python UDFs can be evaluated in vectorized form by using the
> column data as Pandas.Series. This will offer a performance gain by
> computing the return column data in one operation instead of iterating over
> each row to calculate a single element and appending to a list, as is
> currently done.
> This is designed as a preliminary step for the existing SPIP: Vectorized UDFs
> in Python SPARK-21190 that could be used as a basis for whatever expanded API
> is decided upon there.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]