gaogaotiantian commented on code in PR #53769:
URL: https://github.com/apache/spark/pull/53769#discussion_r2710893016
##########
python/pyspark/worker.py:
##########
@@ -412,14 +530,17 @@ def get_args(*args: pd.Series):
return zip(*args)
if runner_conf.arrow_concurrency_level > 0:
- from concurrent.futures import ThreadPoolExecutor
@fail_on_stopiteration
def evaluate(*args: pd.Series) -> pd.Series:
- with
ThreadPoolExecutor(max_workers=runner_conf.arrow_concurrency_level) as pool:
- return pd.Series(
- list(pool.map(lambda row: result_func(func(*row)),
get_args(*args)))
Review Comment:
I asked the same question in
https://github.com/apache/spark/pull/53769#discussion_r2683544999 . Now that I
think about it, I don't believe we need to batch it at all. The reason that
batchsize is important for process worker is because the communication is
expensive. The communication of thread worker is trivial and batching them
itself has overhead.
I believe we need benchmark proof to do extra work on batching - we need to
see a signifncant performance improvement for at least some cases. Otherwise we
should just use simple `map`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]