Yicong Huang created SPARK-55473:
------------------------------------
Summary: Optimize iterator handling in grouped map pandas UDF with
state to reduce buffering overhead
Key: SPARK-55473
URL: https://issues.apache.org/jira/browse/SPARK-55473
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
The mapper function for SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE in worker.py uses
itertools.tee which may introduce buffering overhead. Replace it with consuming
the first element and using a generator to "put it back" for lazy iteration.
Current code:
{code:python}
from itertools import tee
keys_gen, values_gen = tee(data_gen) # this tee buffers data
keys_elem = next(keys_gen)
keys = [keys_elem[o] for o in parsed_offsets[0][0]]
vals = ([x[o] for o in parsed_offsets[0][1]] for x in values_gen)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]