viirya opened a new pull request, #56470:
URL: https://github.com/apache/spark/pull/56470

   ### What changes were proposed in this pull request?
   
   Follow-up to SPARK-57137/SPARK-57138. Apply the two conventions those PRs 
established to the remaining mixins in `python/benchmarks/bench_eval_type.py`:
   
   1. Convert the seven remaining `@staticmethod` `_build_scenario` definitions 
to `@classmethod`, replacing the hardcoded 
`_XBenchMixin._scenario_configs[name]` lookups with 
`cls._scenario_configs[name]` (`_ArrowBatchedBenchMixin`, 
`_ArrowUDTFBenchMixin`, `_ArrowTableUDFBenchMixin`, 
`_GroupedMapArrowBenchMixin`, `_GroupedMapPandasBenchMixin`, 
`_MapArrowIterBenchMixin`, `_MapPandasIterBenchMixin`).
   2. Add the `_eval_type` class attribute to the GroupedMap Arrow/Pandas bases 
and parameterize their `_write_scenario` on `self._eval_type`. This lets 
`_GroupedMapArrowIterBenchMixin` and `_GroupedMapPandasIterBenchMixin` drop 
their `_write_scenario` overrides, which were copies of the parent bodies 
differing only in the `PythonEvalType.SQL_...` constant.
   
   Net diff: +27 / -53 lines.
   
   ### Why are the changes needed?
   
   The hardcoded class-name lookup is a footgun: if a subclass overrides 
`_scenario_configs`, the inherited static `_build_scenario` silently reads the 
parent's configs instead. This is the exact failure mode the 
already-deduplicated pairs (Scalar, GroupedAgg, Window, Cogroup) avoid by 
resolving configs via `cls`. The two GroupedMap Iter subclasses also carried 
near-identical copies of `_write_scenario`, so any protocol-writing change had 
to be applied in lock-step across parent and child.
   
   After this change every `_build_scenario` in the file resolves configs via 
`cls` and no redundant `_write_scenario` copies remain.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only change in the benchmark module.
   
   ### How was this patch tested?
   
   - Verified the generated worker input is byte-identical to the pre-refactor 
output for all 20 `*TimeBench` classes across every (scenario, udf) cell, with 
the cloudpickled UDF command excluded (it embeds `co_firstlineno` and a random 
cloudpickle class-tracker id, both location/identity metadata unaffected by 
this change).
   - Compared the pickle opcode streams of all 60 pickled UDF/UDTF callables 
between old and new: the only differences are the metadata above; code, 
constants, and names are unchanged.
   - Ran `setup` + `time_worker` end-to-end for all 9 affected `*TimeBench` 
classes across every UDF, and `peakmem_worker` for the two Iter classes that 
dropped `_write_scenario`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes. Generated-by: Claude Code (claude-opus-4-8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to