[PR] [SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF [spark]

via GitHub Mon, 04 May 2026 14:10:19 -0700


Yicong-Huang opened a new pull request, #55675:
URL: https://github.com/apache/spark/pull/55675


   ### What changes were proposed in this pull request?
   
   Refactor `SQL_GROUPED_MAP_PANDAS_ITER_UDF` to be self-contained in 
`read_udfs()`.
   
   ### Why are the changes needed?
   
   Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each 
eval type self-contained in `read_udfs()` improves readability and makes it 
easier to reason about the data flow for each eval type independently.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests. No behavior change.
   
   ASV benchmark (`GroupedMapPandasIterUDFTimeBench`, single run with `-a 
repeat=5`):
   
   ```text
   master: 4b3f8c3796e  vs  PR: 29538fd7980
   
   Time (ms, lower = better)
   scenario           udf                   master       PR       diff
   sm_grp_few_col     identity_udf            447.4    441.0    -1.43%
   sm_grp_few_col     sort_udf                499.5    498.8    -0.14%
   sm_grp_few_col     key_identity_udf        449.9    411.8    -8.46%
   sm_grp_many_col    identity_udf            358.3    375.5    +4.79%
   sm_grp_many_col    sort_udf                378.5    388.7    +2.70%
   sm_grp_many_col    key_identity_udf        371.3    341.1    -8.14%
   lg_grp_few_col     identity_udf            802.7    791.6    -1.39%
   lg_grp_few_col     sort_udf                993.7    949.8    -4.42%
   lg_grp_few_col     key_identity_udf        682.4    691.2    +1.30%
   lg_grp_many_col    identity_udf            928.7    911.1    -1.89%
   lg_grp_many_col    sort_udf               1010.4    963.1    -4.69%
   lg_grp_many_col    key_identity_udf        897.8    919.7    +2.44%
   mixed_types        identity_udf            446.2    431.3    -3.34%
   mixed_types        sort_udf                471.2    450.0    -4.50%
   mixed_types        key_identity_udf        399.8    383.4    -4.10%
   SUM                                       9137.8   8948.1    -2.08%
   ```
   
   Aggregate slightly improved (-2.08%); per-scenario variation within 
run-to-run noise.
   
   Peakmem benchmark (`GroupedMapPandasIterUDFPeakmemBench`) was essentially 
flat (SUM -0.02%).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56691][PYTHON] Refactor SQL_GROUPED_MAP_PANDAS_ITER_UDF [spark]

Reply via email to