(spark) branch master updated: [SPARK-54859][PYTHON] Arrow by default PySpark UDF API reference doc

ruifengz Tue, 30 Dec 2025 17:27:39 -0800

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 2ab68d10e806 [SPARK-54859][PYTHON] Arrow by default PySpark UDF API 
reference doc
2ab68d10e806 is described below

commit 2ab68d10e8064fc6e7728677ff3a38c528e59e5d
Author: Amanda Liu <[email protected]>
AuthorDate: Wed Dec 31 09:27:05 2025 +0800

    [SPARK-54859][PYTHON] Arrow by default PySpark UDF API reference doc
    
    ### What changes were proposed in this pull request?
    
    Add doc about arrow by default enablement in Spark 4.2, for this page: 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html
    
    Also add an example specifying how to opt out of arrow optimization, on a 
per-UDF and per-session level.
    
    ### Why are the changes needed?
    
    In Spark 4.2.0, we will enable arrow-optimization for Python UD(T)Fs by 
default. (see: 
[SPARK-54555](https://issues.apache.org/jira/browse/SPARK-54555)). Docs should 
be updated to note the change and include more code examples.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this is a documentation-only update.
    
    ### How was this patch tested?
    
    Docs build tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #53632 from asl3/pyspark-apiref-arrowudfdoc.
    
    Authored-by: Amanda Liu <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 python/pyspark/sql/functions/builtin.py | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 04b800be2372..4632583e2459 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -28927,6 +28927,9 @@ def udf(
     .. versionchanged:: 4.0.0
         Supports keyword-arguments.
 
+    .. versionchanged:: 4.2.0
+        Uses Arrow by default for (de)serialization.
+
     Parameters
     ----------
     f : function, optional
@@ -29029,6 +29032,28 @@ def udf(
     |                             101|
     +--------------------------------+
 
+    Arrow-optimized Python UDFs (default since Spark 4.2):
+
+    Since Spark 4.2, Arrow is used by default for (de)serialization between 
the JVM
+    and Python for regular Python UDFs.
+
+    Unlike the vectorized Arrow UDFs above that receive and return 
``pyarrow.Array`` objects,
+    Arrow-optimized Python UDFs still process data row-by-row with regular 
Python types,
+    but use Arrow for more efficient data transfer in the (de)serialization 
process.
+
+    >>> # Arrow optimization is enabled by default since Spark 4.2
+    >>> @udf(returnType=IntegerType())
+    ... def my_udf(x):
+    ...     return x + 1
+    ...
+    >>> # To explicitly disable Arrow optimization and use pickle-based 
serialization:
+    >>> @udf(returnType=IntegerType(), useArrow=False)
+    ... def legacy_udf(x):
+    ...     return x + 1
+    ...
+    >>> # To disable Arrow optimization for the entire SparkSession:
+    >>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", "false") 
 # doctest: +SKIP
+
     See Also
     --------
     :meth:`pyspark.sql.functions.pandas_udf`


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54859][PYTHON] Arrow by default PySpark UDF API reference doc

Reply via email to