[ https://issues.apache.org/jira/browse/SPARK-52214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruifeng Zheng updated SPARK-52214: ---------------------------------- Description: Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow conversion, but: * This conversion is not stable, and gets broken from time to time -- [The Arrow 13 upgrade|https://github.com/apache/spark/pull/42920], pandas UDFs with data/time types are all broken; -- [Weird behavior|https://github.com/apache/spark/commit/9d88020f246ff4af1afba5a4023e139643fb3f54] when the dataset is empty * Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible [in certain narrow cases|https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions], e.g. StringType is not supported; * The support of complex type is not good, e.g. to support {{StructType}} series, we needs to use {{pd.DataFrame}} as a workaround; Arrow UDF is designed to resolve above issues. > Python Arrow UDF > ---------------- > > Key: SPARK-52214 > URL: https://issues.apache.org/jira/browse/SPARK-52214 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark > Affects Versions: 4.1.0 > Reporter: Ruifeng Zheng > Assignee: Ruifeng Zheng > Priority: Major > > Pandas UDFs (and Pandas Functions like MapInPandas) have a pandas <> arrow > conversion, but: > * This conversion is not stable, and gets broken from time to time > -- [The Arrow 13 upgrade|https://github.com/apache/spark/pull/42920], pandas > UDFs with data/time types are all broken; > -- [Weird > behavior|https://github.com/apache/spark/commit/9d88020f246ff4af1afba5a4023e139643fb3f54] > when the dataset is empty > * Pandas <> Arrow conversion is pretty expensive. Zero-copy is only possible > [in certain narrow > cases|https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions], > e.g. StringType is not supported; > * The support of complex type is not good, e.g. to support {{StructType}} > series, we needs to use {{pd.DataFrame}} as a workaround; > Arrow UDF is designed to resolve above issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org