Yicong-Huang opened a new pull request, #54250: URL: https://github.com/apache/spark/pull/54250
### What changes were proposed in this pull request? Replace `itertools.tee` with `itertools.chain` in the `applyInPandasWithState` mapper function to eliminate unnecessary buffering overhead. The previous implementation used `tee` to peek the first element for key extraction while maintaining lazy evaluation. However, `tee` maintains an internal buffer to support independent iteration, which introduces memory overhead since we only consume the first element from one iterator and then discard it. The new implementation: - Directly consumes the first element to extract keys - Uses `itertools.chain` to prepend the first element back into the values generator - Maintains lazy evaluation semantics without buffering overhead ### Why are the changes needed? This optimization reduces memory overhead for large groups in stateful streaming operations using `applyInPandasWithState`. ### Does this PR introduce any user-facing changes? No, this is an internal optimization with no user-facing behavior changes. ### How was this patch tested? - Verified Python syntax with `py_compile` - Existing tests in `pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state` cover this code path 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
