Re: [PR] [SPARK-56642][SQL] Add pipelined JVM-Python UDF data transfer [spark]

via GitHub Tue, 05 May 2026 07:28:24 -0700


dongjoon-hyun commented on code in PR #55552:
URL: https://github.com/apache/spark/pull/55552#discussion_r3189220178



##########
core/src/main/scala/org/apache/spark/internal/config/Python.scala:
##########
@@ -150,4 +150,28 @@ private[spark] object Python {
       .version("4.1.0")
       .booleanConf
       .createWithDefault(true)
+
+  val PYTHON_UDF_PIPELINED_EXECUTION =
+    ConfigBuilder("spark.python.udf.pipelined.enabled")
+      .doc("When true, enables pipelined (asynchronous) data transfer between 
JVM and Python " +
+        "UDF workers. In pipelined mode, input serialization runs in a 
separate writer thread " +
+        "while the main task thread reads output from the Python worker, 
allowing the two " +
+        "directions to overlap for improved throughput. " +
+        "This is particularly beneficial for compute-heavy UDFs (e.g., ML 
inference).")
+      .version("4.2.0")

Review Comment:
   `4.3.0`?



##########
core/src/main/scala/org/apache/spark/internal/config/Python.scala:
##########
@@ -150,4 +150,28 @@ private[spark] object Python {
       .version("4.1.0")
       .booleanConf
       .createWithDefault(true)
+
+  val PYTHON_UDF_PIPELINED_EXECUTION =
+    ConfigBuilder("spark.python.udf.pipelined.enabled")
+      .doc("When true, enables pipelined (asynchronous) data transfer between 
JVM and Python " +
+        "UDF workers. In pipelined mode, input serialization runs in a 
separate writer thread " +
+        "while the main task thread reads output from the Python worker, 
allowing the two " +
+        "directions to overlap for improved throughput. " +
+        "This is particularly beneficial for compute-heavy UDFs (e.g., ML 
inference).")
+      .version("4.2.0")
+      .withBindingPolicy(ConfigBindingPolicy.NOT_APPLICABLE)
+      .booleanConf
+      .createWithDefault(false)
+
+  val PYTHON_UDF_PIPELINED_QUEUE_DEPTH =
+    ConfigBuilder("spark.python.udf.pipelined.queueDepth")
+      .doc("The maximum number of input batches the Python worker's background 
reader thread " +
+        "can pre-fetch ahead of UDF computation. A higher value allows more 
overlap between " +
+        "input reading and UDF processing, at the cost of increased memory 
usage. " +
+        "Only effective when spark.python.udf.pipelined.enabled is true.")
+      .version("4.2.0")

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56642][SQL] Add pipelined JVM-Python UDF data transfer [spark]

Reply via email to