[ 
https://issues.apache.org/jira/browse/SPARK-55473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-55473:
-----------------------------------
    Labels: pull-request-available  (was: )

> Optimize iterator handling in grouped map pandas UDF with state to reduce 
> buffering overhead
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-55473
>                 URL: https://issues.apache.org/jira/browse/SPARK-55473
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> The mapper function for SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE in worker.py 
> uses itertools.tee which may introduce buffering overhead. Replace it with 
> consuming the first element and using a generator to "put it back" for lazy 
> iteration.
> Current code:
> {code:python}
> from itertools import tee
> keys_gen, values_gen = tee(data_gen) # this tee buffers data
> keys_elem = next(keys_gen)
> keys = [keys_elem[o] for o in parsed_offsets[0][0]]
> vals = ([x[o] for o in parsed_offsets[0][1]] for x in values_gen)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to