[
https://issues.apache.org/jira/browse/SPARK-52214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved SPARK-52214.
-----------------------------------
Fix Version/s: 4.1.0
Resolution: Fixed
> Python Arrow UDF
> ----------------
>
> Key: SPARK-52214
> URL: https://issues.apache.org/jira/browse/SPARK-52214
> Project: Spark
> Issue Type: Umbrella
> Components: Connect, PySpark
> Affects Versions: 4.1.0
> Reporter: Ruifeng Zheng
> Assignee: Ruifeng Zheng
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow
> conversion, but:
> * This conversion is not stable, and gets broken from time to time
> -- [The Arrow 13 upgrade|https://github.com/apache/spark/pull/42920], pandas
> UDFs with data/time types are all broken;
> -- [Weird
> behavior|https://github.com/apache/spark/commit/9d88020f246ff4af1afba5a4023e139643fb3f54]
> when the dataset is empty
> * Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible
> [in certain narrow
> cases|https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions],
> e.g. StringType is not supported;
> * The support of complex type is not good, e.g. to support {{StructType}}
> series, we needs to use {{pd.DataFrame}} as a workaround;
> Arrow UDF is designed to resolve above issues.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]