Yicong-Huang opened a new pull request, #54250:
URL: https://github.com/apache/spark/pull/54250

   ### What changes were proposed in this pull request?
   
   Replace `itertools.tee` with `itertools.chain` in the 
`applyInPandasWithState` mapper function to eliminate unnecessary buffering 
overhead.
   
   The previous implementation used `tee` to peek the first element for key 
extraction while maintaining lazy evaluation. However, `tee` maintains an 
internal buffer to support independent iteration, which introduces memory 
overhead since we only consume the first element from one iterator and then 
discard it.
   
   The new implementation:
   - Directly consumes the first element to extract keys
   - Uses `itertools.chain` to prepend the first element back into the values 
generator
   - Maintains lazy evaluation semantics without buffering overhead
   
   ### Why are the changes needed?
   
   This optimization reduces memory overhead for large groups in stateful 
streaming operations using `applyInPandasWithState`.
   
   ### Does this PR introduce any user-facing changes?
   
   No, this is an internal optimization with no user-facing behavior changes.
   
   ### How was this patch tested?
   
   - Verified Python syntax with `py_compile`
   - Existing tests in 
`pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state` cover this code 
path
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to