[
https://issues.apache.org/jira/browse/SPARK-55473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon reassigned SPARK-55473:
------------------------------------
Assignee: Yicong Huang
> Optimize iterator handling in grouped map pandas UDF with state to reduce
> buffering overhead
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-55473
> URL: https://issues.apache.org/jira/browse/SPARK-55473
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Assignee: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> The mapper function for SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE in worker.py
> uses itertools.tee which may introduce buffering overhead. Replace it with
> consuming the first element and using a generator to "put it back" for lazy
> iteration.
> Current code:
> {code:python}
> from itertools import tee
> keys_gen, values_gen = tee(data_gen) # this tee buffers data
> keys_elem = next(keys_gen)
> keys = [keys_elem[o] for o in parsed_offsets[0][0]]
> vals = ([x[o] for o in parsed_offsets[0][1]] for x in values_gen)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]