xinrong-meng opened a new pull request, #46857:
URL: https://github.com/apache/spark/pull/46857

   ### What changes were proposed in this pull request?
   Turn on Arrow optimization for Python UDFs by default.
   
   ### Why are the changes needed?
   Arrow-optimized Python UDFs are introduced in Spark 3.5, with major 
advantages over the current default, pickled Python UDFs:
   
   **Faster (De)serialization**
   Apache Arrow is a columnar in-memory data format that provides efficient 
data interchange between different systems and programming languages. Unlike 
Pickle, which serializes an entire Row as an object, Arrow stores data in a 
column-oriented format, allowing for better compression and memory locality, 
which is more suitable for analytical workloads.
   The chart below shows the performance of an Arrow-optimized Python UDF 
performing a single transformation with a different-sized input dataset. The 
cluster consists of 3 workers and 1 driver, and each machine in the cluster has 
16 vCPUs and 122 GiBs memory. The Arrow-optimized Python UDF is **~1.6 times** 
faster than the pickled Python UDF.
   
![image](https://github.com/apache/spark/assets/47337188/b585313d-54cd-49b9-8f54-a2581b1bc909)
   
   
   **Standardized Type Coercion**
   UDF type coercion poses challenges when the Python values returned by the 
UDF do not align with the user-specified return type. Unfortunately, the 
default, pickled Python UDF's type coercion has certain limitations, such as 
relying on None as a fallback for type mismatches, leading to potential 
ambiguity and data loss. Additionally, converting date, datetime, and tuples to 
strings can yield ambiguous results. Arrow-optimized Python UDFs address these 
issues by leveraging Arrow's well-defined set of rules for type coercion.  An 
example is as shown below:
   ```py
   >>> df = spark.createDataFrame([datetime.date(1970, 1, 1), 
datetime.date(1970, 1, 2)], schema='date')
   >>> df.select(udf(lambda x: x, 'string', 
useArrow=True)('value').alias('date_in_string')).show()
   +--------------+
   |date_in_string|
   +--------------+
   |    1970-01-01|
   |    1970-01-02|
   +--------------+
   >>> df.select(udf(lambda x: x, 'string', 
useArrow=False)('value').alias('date_in_string')).show()
   +-----------------------------------------------------------------------+
   |date_in_string                                                         |
   +-----------------------------------------------------------------------+
   |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
   |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
   +-----------------------------------------------------------------------+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No for most cases, Arrow optimization for Python UDFs is turned on by 
default, and results will be identical. However, there are some corner cases 
when the type coercion rule differs between Arrow and Pickle, which will be 
addressed in a follow-up PR to instruct users on a smooth migration. 
   
   See 
[here](https://www.databricks.com/wp-content/uploads/notebooks/python-udf-type-coercion.html)
 for a comprehensive comparison between Pickled Python UDFs and Arrow-optimized 
Python UDFs regarding type coercion.
   
   ### How was this patch tested?
   Existing tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to