This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 2ab68d10e806 [SPARK-54859][PYTHON] Arrow by default PySpark UDF API
reference doc
2ab68d10e806 is described below
commit 2ab68d10e8064fc6e7728677ff3a38c528e59e5d
Author: Amanda Liu <[email protected]>
AuthorDate: Wed Dec 31 09:27:05 2025 +0800
[SPARK-54859][PYTHON] Arrow by default PySpark UDF API reference doc
### What changes were proposed in this pull request?
Add doc about arrow by default enablement in Spark 4.2, for this page:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html
Also add an example specifying how to opt out of arrow optimization, on a
per-UDF and per-session level.
### Why are the changes needed?
In Spark 4.2.0, we will enable arrow-optimization for Python UD(T)Fs by
default. (see:
[SPARK-54555](https://issues.apache.org/jira/browse/SPARK-54555)). Docs should
be updated to note the change and include more code examples.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
Docs build tests
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #53632 from asl3/pyspark-apiref-arrowudfdoc.
Authored-by: Amanda Liu <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
python/pyspark/sql/functions/builtin.py | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/python/pyspark/sql/functions/builtin.py
b/python/pyspark/sql/functions/builtin.py
index 04b800be2372..4632583e2459 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -28927,6 +28927,9 @@ def udf(
.. versionchanged:: 4.0.0
Supports keyword-arguments.
+ .. versionchanged:: 4.2.0
+ Uses Arrow by default for (de)serialization.
+
Parameters
----------
f : function, optional
@@ -29029,6 +29032,28 @@ def udf(
| 101|
+--------------------------------+
+ Arrow-optimized Python UDFs (default since Spark 4.2):
+
+ Since Spark 4.2, Arrow is used by default for (de)serialization between
the JVM
+ and Python for regular Python UDFs.
+
+ Unlike the vectorized Arrow UDFs above that receive and return
``pyarrow.Array`` objects,
+ Arrow-optimized Python UDFs still process data row-by-row with regular
Python types,
+ but use Arrow for more efficient data transfer in the (de)serialization
process.
+
+ >>> # Arrow optimization is enabled by default since Spark 4.2
+ >>> @udf(returnType=IntegerType())
+ ... def my_udf(x):
+ ... return x + 1
+ ...
+ >>> # To explicitly disable Arrow optimization and use pickle-based
serialization:
+ >>> @udf(returnType=IntegerType(), useArrow=False)
+ ... def legacy_udf(x):
+ ... return x + 1
+ ...
+ >>> # To disable Arrow optimization for the entire SparkSession:
+ >>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", "false")
# doctest: +SKIP
+
See Also
--------
:meth:`pyspark.sql.functions.pandas_udf`
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]