[PR] [SPARK-56718][PYTHON] Refactor SQL_COGROUPED_MAP_PANDAS_UDF [spark]

via GitHub Mon, 04 May 2026 13:22:39 -0700


Yicong-Huang opened a new pull request, #55674:
URL: https://github.com/apache/spark/pull/55674


   ### What changes were proposed in this pull request?
   
   Refactor `SQL_COGROUPED_MAP_PANDAS_UDF` to be self-contained in 
`read_udfs()`, moving logic from `CogroupPandasUDFSerializer` into `worker.py`, 
using `ArrowStreamCoGroupSerializer` as pure I/O.
   
   ### Why are the changes needed?
   
   Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each 
eval type self-contained in `read_udfs()` improves readability and makes it 
easier to reason about the data flow for each eval type independently.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests. No behavior change.
   
   ASV benchmark (`CogroupedMapPandasUDFTimeBench`), 3 runs each, averaged:
   
   ```text
   master: ebfdf8b6b4  vs  PR: 87d96ebe40
   
   Time (ms, lower = better)
   scenario          udf                  master      PR        diff
   ---------------------------------------------------------------------
   few_groups_sm     identity_udf          160.3     159.0   -0.81%
   few_groups_sm     concat_udf            181.0     181.0    0.00%
   few_groups_sm     left_semi_udf         182.7     181.7   -0.55%
   few_groups_sm     key_identity_udf      160.3     162.0   +1.04%
   few_groups_lg     identity_udf          413.3     411.3   -0.48%
   few_groups_lg     concat_udf            545.7     537.3   -1.54%
   few_groups_lg     left_semi_udf         508.3     500.7   -1.51%
   few_groups_lg     key_identity_udf      412.3     412.3    0.00%
   many_groups_sm    identity_udf         1356.7    1356.7    0.00%
   many_groups_sm    concat_udf           1480.0    1473.3   -0.45%
   many_groups_sm    left_semi_udf        1496.7    1503.3   +0.44%
   many_groups_sm    key_identity_udf     1370.0    1363.3   -0.49%
   many_groups_lg    identity_udf          765.3     760.7   -0.61%
   many_groups_lg    concat_udf            922.7     908.7   -1.52%
   many_groups_lg    left_semi_udf         887.0     886.0   -0.11%
   many_groups_lg    key_identity_udf      764.7     767.0   +0.30%
   wide_values       identity_udf         1073.3    1073.3    0.00%
   wide_values       concat_udf           1276.7    1266.7   -0.78%
   wide_values       left_semi_udf        1130.0    1130.0    0.00%
   wide_values       key_identity_udf     1080.0    1073.3   -0.62%
   multi_key         identity_udf          425.3     423.7   -0.39%
   multi_key         concat_udf            473.7     473.0   -0.14%
   multi_key         left_semi_udf         463.0     451.3   -2.52%
   multi_key         key_identity_udf      430.3     428.3   -0.46%
   ---------------------------------------------------------------------
   SUM                                  17995.0   17834.0   -0.90%
   ```
   
   Aggregate essentially flat (-0.90%); per-scenario variation within 
run-to-run noise.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56718][PYTHON] Refactor SQL_COGROUPED_MAP_PANDAS_UDF [spark]

Reply via email to