viirya opened a new pull request, #56194:
URL: https://github.com/apache/spark/pull/56194

   ### What changes were proposed in this pull request?
   
   Follow-up to SPARK-57137. Extend the Arrow/Pandas sibling base-mixin pattern 
to the remaining two pairs in `python/benchmarks/bench_eval_type.py`:
   
   - `_WindowAggPandasBenchMixin` now subclasses `_WindowAggArrowBenchMixin`
   - `_CogroupedMapPandasBenchMixin` now subclasses 
`_CogroupedMapArrowBenchMixin`
   
   The shared `_build_scenario` / `_write_scenario` are pulled up into the 
Arrow base, with the eval type parameterized via the `_eval_type` class 
attribute and `_build_scenario` converted from `@staticmethod` to 
`@classmethod` so subclasses read their own `_scenario_configs` (the same 
mechanism SPARK-57137 used for the Scalar and GroupedAgg pairs).
   
   - Window: the Pandas half drops `_scenario_configs` entirely (identical to 
the Arrow variant) and keeps only `_eval_type`, its UDFs, and `params`.
   - Cogroup: the Arrow `_udfs` values are normalized to `(func, n_args)` 
tuples to match the Pandas sibling, so the inherited `_write_scenario` works 
unchanged. The Pandas half keeps its own scaled-down `_scenario_configs` and 
the extra 3-arg `key_identity_udf` variant.
   
   Net diff: +39 / -102 lines.
   
   ### Why are the changes needed?
   
   These two pairs were intentionally left out of SPARK-57137 to avoid 
conflicting with two in-flight PRs (#56167 made the Window Arrow mixin 
scenarios lazy; #56171 renamed `wide_values` to `wide_cols` in the Cogroup 
pair). Both have since merged, so the pairs can now be folded into the same 
base-class pattern.
   
   Before this change, each sibling pair carried two near-identical 
`_write_scenario` bodies differing only in the `PythonEvalType.SQL_...` 
constant (and, for Cogroup, the UDF set) -- a known footgun where any 
protocol-writing change had to be applied in lock-step across both halves.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only change in the benchmark module.
   
   ### How was this patch tested?
   
   - Structural: confirmed `_eval_type` resolves correctly via MRO for all four 
affected mixins; confirmed `_CogroupedMapPandasBenchMixin._scenario_configs` 
still holds the scaled-down pandas row counts (not the Arrow base's), the main 
MRO-resolution risk of switching `_build_scenario` to `@classmethod`.
   - Ran `setup` + `time_worker` end-to-end for all four affected `*TimeBench` 
classes across every UDF (including the Pandas-only 3-arg `key_identity_udf`).
   - Ran `peakmem_worker` (disk-replay path) for the Window and Cogroup Pandas 
classes.
   - Confirmed the generated wire bytes are byte-identical to the pre-refactor 
output for Window and Cogroup Arrow; the Cogroup Pandas UDF pickles differ only 
in the embedded class-hierarchy reference (same length, identical execution), 
matching what SPARK-57137 produced for the Scalar/GroupedAgg pairs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes. Generated-by: Claude Code (claude-opus-4-8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to