xinrong-meng opened a new pull request, #39384:
URL: https://github.com/apache/spark/pull/39384

   ### What changes were proposed in this pull request?
   Introduce Arrow-optimized Python UDFs.
   
   There are two ways to enable/disable the Arrow optimization for Python UDFs:
   
   - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, 
disabled by default.
   - the `useArrow` parameter of the `udf` function, None by default.
   
   The Spark configuration takes effect only when `useArrow` is None. 
Otherwise, `use arrow` decides whether the user-defined function is optimized 
by Arrow or not.
   
   The reason why we introduce these two ways is to provide convenient, 
per-Spark-session control and finer-grained, per-UDF control of the Arrow 
optimization for Python UDFs.
   
   ### Why are the changes needed?
   Python user-defined function (UDF) enables users to run arbitrary code 
against PySpark columns. It uses Pickle for (de)serialization and executes row 
by row.
   
   One major performance bottleneck of Python UDFs is (de)serialization, that 
is, the data interchanging between the worker JVM and the spawned Python 
subprocess which actually executes the UDF. 
   
   The PR proposes a better alternative to handle the (de)serialization: Arrow, 
which is used in the (de)serialization of Pandas UDF already. 
   
   ### Does this PR introduce _any_ user-facing change?
   No, since the Arrow optimization for Python UDFs is disabled by default.
   
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to