xinrong-meng opened a new pull request, #46857: URL: https://github.com/apache/spark/pull/46857
### What changes were proposed in this pull request? Turn on Arrow optimization for Python UDFs by default. ### Why are the changes needed? Arrow-optimized Python UDFs are introduced in Spark 3.5, with major advantages over the current default, pickled Python UDFs: **Faster (De)serialization** Apache Arrow is a columnar in-memory data format that provides efficient data interchange between different systems and programming languages. Unlike Pickle, which serializes an entire Row as an object, Arrow stores data in a column-oriented format, allowing for better compression and memory locality, which is more suitable for analytical workloads. The chart below shows the performance of an Arrow-optimized Python UDF performing a single transformation with a different-sized input dataset. The cluster consists of 3 workers and 1 driver, and each machine in the cluster has 16 vCPUs and 122 GiBs memory. The Arrow-optimized Python UDF is **~1.6 times** faster than the pickled Python UDF.  **Standardized Type Coercion** UDF type coercion poses challenges when the Python values returned by the UDF do not align with the user-specified return type. Unfortunately, the default, pickled Python UDF's type coercion has certain limitations, such as relying on None as a fallback for type mismatches, leading to potential ambiguity and data loss. Additionally, converting date, datetime, and tuples to strings can yield ambiguous results. Arrow-optimized Python UDFs address these issues by leveraging Arrow's well-defined set of rules for type coercion. An example is as shown below: ```py >>> df = spark.createDataFrame([datetime.date(1970, 1, 1), datetime.date(1970, 1, 2)], schema='date') >>> df.select(udf(lambda x: x, 'string', useArrow=True)('value').alias('date_in_string')).show() +--------------+ |date_in_string| +--------------+ | 1970-01-01| | 1970-01-02| +--------------+ >>> df.select(udf(lambda x: x, 'string', useArrow=False)('value').alias('date_in_string')).show() +-----------------------------------------------------------------------+ |date_in_string | +-----------------------------------------------------------------------+ |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..| |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..| +-----------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No for most cases, Arrow optimization for Python UDFs is turned on by default, and results will be identical. However, there are some corner cases when the type coercion rule differs between Arrow and Pickle, which will be addressed in a follow-up PR to instruct users on a smooth migration. See [here](https://www.databricks.com/wp-content/uploads/notebooks/python-udf-type-coercion.html) for a comprehensive comparison between Pickled Python UDFs and Arrow-optimized Python UDFs regarding type coercion. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
