Re: [PR] [SPARK-55012][PYTHON] Optimize ThreadPool usage in Arrow UDF concurrent execution [spark]

via GitHub Tue, 20 Jan 2026 20:11:17 -0800


gaogaotiantian commented on code in PR #53769:
URL: https://github.com/apache/spark/pull/53769#discussion_r2710893016



##########
python/pyspark/worker.py:
##########
@@ -412,14 +530,17 @@ def get_args(*args: pd.Series):
             return zip(*args)
 
     if runner_conf.arrow_concurrency_level > 0:
-        from concurrent.futures import ThreadPoolExecutor
 
         @fail_on_stopiteration
         def evaluate(*args: pd.Series) -> pd.Series:
-            with 
ThreadPoolExecutor(max_workers=runner_conf.arrow_concurrency_level) as pool:
-                return pd.Series(
-                    list(pool.map(lambda row: result_func(func(*row)), 
get_args(*args)))

Review Comment:
   I asked the same question in 
https://github.com/apache/spark/pull/53769#discussion_r2683544999 . Now that I 
think about it, I don't believe we need to batch it at all. The reason that 
batchsize is important for process worker is because the communication is 
expensive. The communication of thread worker is trivial and batching them 
itself has overhead.
   
   I believe we need benchmark proof to do extra work on batching - we need to 
see a signifncant performance improvement for at least some cases. Otherwise we 
should just use simple `map`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55012][PYTHON] Optimize ThreadPool usage in Arrow UDF concurrent execution [spark]

Reply via email to