Yicong-Huang opened a new pull request, #54239: URL: https://github.com/apache/spark/pull/54239
### What changes were proposed in this pull request? This PR optimizes the `wrap_grouped_map_pandas_udf` function in `python/pyspark/worker.py` to fix a 3x performance regression in `applyInPandas` for workloads with large groups and few columns. The optimization changes the double concat pattern (concat within batch + concat across batches) to a single concat per column approach, avoiding the expensive `pd.concat(axis=0)` across hundreds of intermediate DataFrames. ### Why are the changes needed? After SPARK-54316 consolidated `GroupPandasIterUDFSerializer` with `GroupPandasUDFSerializer`, the double concat pattern was introduced: 1. First concat: `pd.concat(value_series, axis=1)` for each batch 2. Second concat: `pd.concat(value_dataframes, axis=0)` across all batches For large groups (millions of rows), the second concat becomes extremely expensive, causing 73% performance regression (4.38s → 7.57s) in production workloads and 3x slowdown in benchmarks. ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization with no API or behavior changes. ### How was this patch tested? Unit tests with synthetic data (5M rows, 3 columns, 500 batches): - Before: 0.226s - After: 0.075s - Improvement: 3x faster, 25% less memory Existing PySpark tests pass without modification, confirming functional correctness. ### Was this patch authored or co-authored using generative AI tooling? Yes. Co-generated with Claude Sonnet 4.5 for performance analysis and optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
