[
https://issues.apache.org/jira/browse/SPARK-52214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036467#comment-18036467
]
Dongjoon Hyun commented on SPARK-52214:
---------------------------------------
Although I marked it resolved for now because it seems that we are adding test
coverages mostly. We can add more improvement subtasks until the mid of
November and bug fixes before starting RC. Please let me know if there is
anything you want me to know, [~podongfeng].
> Python Arrow UDF
> ----------------
>
> Key: SPARK-52214
> URL: https://issues.apache.org/jira/browse/SPARK-52214
> Project: Spark
> Issue Type: Umbrella
> Components: Connect, PySpark
> Affects Versions: 4.1.0
> Reporter: Ruifeng Zheng
> Assignee: Ruifeng Zheng
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow
> conversion, but:
> * This conversion is not stable, and gets broken from time to time
> -- [The Arrow 13 upgrade|https://github.com/apache/spark/pull/42920], pandas
> UDFs with data/time types are all broken;
> -- [Weird
> behavior|https://github.com/apache/spark/commit/9d88020f246ff4af1afba5a4023e139643fb3f54]
> when the dataset is empty
> * Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible
> [in certain narrow
> cases|https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions],
> e.g. StringType is not supported;
> * The support of complex type is not good, e.g. to support {{StructType}}
> series, we needs to use {{pd.DataFrame}} as a workaround;
> Arrow UDF is designed to resolve above issues.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]