Yicong-Huang opened a new pull request, #54239:
URL: https://github.com/apache/spark/pull/54239

   ### What changes were proposed in this pull request?
   
   This PR optimizes the `wrap_grouped_map_pandas_udf` function in 
`python/pyspark/worker.py` to fix a 3x performance regression in 
`applyInPandas` for workloads with large groups and few columns.
   
   The optimization changes the double concat pattern (concat within batch + 
concat across batches) to a single concat per column approach, avoiding the 
expensive `pd.concat(axis=0)` across hundreds of intermediate DataFrames.
   
   ### Why are the changes needed?
   
   After SPARK-54316 consolidated `GroupPandasIterUDFSerializer` with 
`GroupPandasUDFSerializer`, the double concat pattern was introduced:
   1. First concat: `pd.concat(value_series, axis=1)` for each batch
   2. Second concat: `pd.concat(value_dataframes, axis=0)` across all batches
   
   For large groups (millions of rows), the second concat becomes extremely 
expensive, causing 73% performance regression (4.38s → 7.57s) in production 
workloads and 3x slowdown in benchmarks.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is a performance optimization with no API or behavior changes.
   
   ### How was this patch tested?
   
   Unit tests with synthetic data (5M rows, 3 columns, 500 batches):
   - Before: 0.226s
   - After: 0.075s
   - Improvement: 3x faster, 25% less memory
   
   Existing PySpark tests pass without modification, confirming functional 
correctness.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes. Co-generated with Claude Sonnet 4.5 for performance analysis and 
optimization.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to