Yicong Huang created SPARK-55473:
------------------------------------

             Summary: Optimize iterator handling in grouped map pandas UDF with 
state to reduce buffering overhead
                 Key: SPARK-55473
                 URL: https://issues.apache.org/jira/browse/SPARK-55473
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


The mapper function for SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE in worker.py uses 
itertools.tee which may introduce buffering overhead. Replace it with consuming 
the first element and using a generator to "put it back" for lazy iteration.

Current code:
{code:python}
from itertools import tee

keys_gen, values_gen = tee(data_gen) # this tee buffers data
keys_elem = next(keys_gen)
keys = [keys_elem[o] for o in parsed_offsets[0][0]]
vals = ([x[o] for o in parsed_offsets[0][1]] for x in values_gen)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to