(spark) branch master updated: [SPARK-50752][SQL][PYSPARK][FOLLOWUP] Respect PYTHON_UDF_MAX_RECORDS_PER_BATCH in Python worker

kabhwan Sun, 08 Jun 2025 22:37:45 -0700

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new b969ffe5d816 [SPARK-50752][SQL][PYSPARK][FOLLOWUP] Respect 
PYTHON_UDF_MAX_RECORDS_PER_BATCH in Python worker
b969ffe5d816 is described below

commit b969ffe5d81616151efad0bfef2a345aa51f988b
Author: Jungtaek Lim <[email protected]>
AuthorDate: Mon Jun 9 14:37:14 2025 +0900

    [SPARK-50752][SQL][PYSPARK][FOLLOWUP] Respect 
PYTHON_UDF_MAX_RECORDS_PER_BATCH in Python worker
    
    ### What changes were proposed in this pull request?
    
    This PR is a follow-up of SPARK-50752 (PR #49397), which introduced 
PYTHON_UDF_MAX_RECORDS_PER_BATCH into SQLConf to control the batch size of 
normal Python UDF.
    
    This PR is to fix the config to be effective for Python worker side as well.
    
    ### Why are the changes needed?
    
    The original PR enabled the control of batch size for JVM side, but it 
wasn't properly propagated to Python worker since we missed to override the 
value for Python UDF. (Default value before overriding is a static one, 100)
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but it's about tuning and does not impact any change on the output.
    
    ### How was this patch tested?
    
    No automated test, since there is no good way to test this properly since 
missing this does not change the output itself.
    
    This was manually tested with artificial debug logging.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #51124 from 
HeartSaVioR/SPARK-50752-FOLLOWUP-respect-PYTHON_UDF_MAX_RECORDS_PER_BATCH-in-python-worker.
    
    Authored-by: Jungtaek Lim <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
---
 .../scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala   | 2 ++
 1 file changed, 2 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala
index 11773f92c482..4baddcd4d9e7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala
@@ -50,6 +50,8 @@ abstract class BasePythonUDFRunner(
   override val killOnIdleTimeout: Boolean = 
SQLConf.get.pythonUDFWorkerKillOnIdleTimeout
 
   override val bufferSize: Int = 
SQLConf.get.getConf(SQLConf.PYTHON_UDF_BUFFER_SIZE)
+  override val batchSizeForPythonUDF: Int =
+    SQLConf.get.getConf(SQLConf.PYTHON_UDF_MAX_RECORDS_PER_BATCH)
 
   protected def writeUDF(dataOut: DataOutputStream): Unit
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-50752][SQL][PYSPARK][FOLLOWUP] Respect PYTHON_UDF_MAX_RECORDS_PER_BATCH in Python worker

Reply via email to