[PR] [SPARK-57381][PYTHON] Refactor SQL_WINDOW_AGG_PANDAS_UDF [spark]

via GitHub Wed, 10 Jun 2026 16:57:00 -0700


Yicong-Huang opened a new pull request, #56442:
URL: https://github.com/apache/spark/pull/56442


   ### What changes were proposed in this pull request?
   
   Refactor `SQL_WINDOW_AGG_PANDAS_UDF` to be self-contained in `read_udfs()`, 
moving bounded/unbounded window logic from wrapper functions and the old mapper 
into a single execution block that uses `ArrowStreamGroupSerializer` as pure 
I/O. The Arrow-to-pandas conversion (previously in 
`ArrowStreamAggPandasUDFSerializer.load_stream`) and the pandas-to-Arrow 
conversion (previously in the inherited `dump_stream`) now happen directly in 
the execution block via `ArrowBatchTransformer.to_pandas` and 
`PandasToArrowConversion.convert`. The wrapper functions 
`wrap_window_agg_pandas_udf`, `wrap_unbounded_window_agg_pandas_udf` and 
`wrap_bounded_window_agg_pandas_udf` are removed. 
`ArrowStreamAggPandasUDFSerializer` is now only used by 
`SQL_GROUPED_AGG_PANDAS_ITER_UDF`.
   
   This follows the same pattern as #55153 (`SQL_WINDOW_AGG_ARROW_UDF`).
   
   ### Why are the changes needed?
   
   Part of [SPARK-55388](https://issues.apache.org/jira/browse/SPARK-55388).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests. No behavior change.
   
   ASV micro-benchmark comparison (`WindowAggPandasUDFTimeBench`, 3 runs each 
on master vs this branch, averaged):
   
   ```text
   scenario         udf           before (ms)   after (ms)  diff (%)
   few_groups_sm    sum                  56.0         46.9    -16.3%
   few_groups_sm    mean_multi           61.7         52.0    -15.8%
   few_groups_lg    sum                 113.7         99.1    -12.8%
   few_groups_lg    mean_multi          121.7        107.0    -12.0%
   many_groups_sm   sum                2043.3       1663.3    -18.6%
   many_groups_sm   mean_multi         2350.0       1873.3    -20.3%
   many_groups_lg   sum                 601.3        474.7    -21.1%
   many_groups_lg   mean_multi          654.0        533.7    -18.4%
   wide_cols        sum                 574.7        476.0    -17.2%
   wide_cols        mean_multi          562.0        477.7    -15.0%
   ```
   
   Peak memory (`WindowAggPandasUDFPeakmemBench`): no change (87M-228M for all 
scenarios, before == after within noise).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57381][PYTHON] Refactor SQL_WINDOW_AGG_PANDAS_UDF [spark]

Reply via email to