This is an automated email from the ASF dual-hosted git repository.
viirya pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.x by this push:
new b35ec3ac7122 [SPARK-56120][PYTHON][TEST][FOLLOWUP] Make
_WindowAggArrowBenchMixin scenarios lazy
b35ec3ac7122 is described below
commit b35ec3ac7122629cc207f89d57bf7982100fcc60
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed May 27 23:42:53 2026 -0700
[SPARK-56120][PYTHON][TEST][FOLLOWUP] Make _WindowAggArrowBenchMixin
scenarios lazy
### What changes were proposed in this pull request?
Convert `_WindowAggArrowBenchMixin` in
`python/benchmarks/bench_eval_type.py`
to the lazy `_scenario_configs` + `staticmethod _build_scenario(name)`
pattern
used by every other mixin in the file, matching the immediately-following
`_WindowAggPandasBenchMixin`.
### Why are the changes needed?
SPARK-56244 follow-up (commit 1c807ade4a4) removed eager `_scenarios =
_build_scenarios()` from all mixins so that importing the benchmark module no
longer materializes every scenario's Arrow data -- a prerequisite for accurate
per-scenario `peakmem_*` readings under ASV (ASV reports the max RSS observed
in the worker process, so any import-time allocation inflates every subsequent
peakmem result).
SPARK-56120 (`78aaf11728b`, merged the day after the follow-up)
reintroduced the eager pattern in `_WindowAggArrowBenchMixin`, leaving it as
the only mixin in the file still doing class-body data construction. As a
result, `WindowAggArrowUDFPeakmemBench` readings are dominated by the global
import-time allocation rather than the per-scenario footprint.
Measured locally with `tracemalloc`:
- before: import peak = 394.54 MiB
- after: import peak = 29.17 MiB
### Does this PR introduce _any_ user-facing change?
No. Test-only change in the benchmark module.
### How was this patch tested?
- Imported `python.benchmarks.bench_eval_type` and asserted the lazy
structure is in place (`_scenario_configs` present, `_scenarios` absent,
`_build_scenario` is a staticmethod).
- Ran `WindowAggArrowUDFTimeBench.setup` + `time_worker` for
`(many_groups_sm, few_groups_sm) x (sum_udf, mean_multi_udf)`.
- Ran `WindowAggArrowUDFPeakmemBench.setup` + `peakmem_worker` for
`many_groups_sm/sum_udf`.
- Compared import-time peak memory before/after (numbers above).
### Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: Claude Code (claude-opus-4-7)
Closes #56167 from viirya/SPARK-56120-followup.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 7da33f344f46ba9282f4030264b31b173dca4703)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
---
python/benchmarks/bench_eval_type.py | 47 ++++++++++++++++--------------------
1 file changed, 21 insertions(+), 26 deletions(-)
diff --git a/python/benchmarks/bench_eval_type.py
b/python/benchmarks/bench_eval_type.py
index c75e4490d1ed..131ced87dfc8 100644
--- a/python/benchmarks/bench_eval_type.py
+++ b/python/benchmarks/bench_eval_type.py
@@ -1795,41 +1795,36 @@ class _WindowAggArrowBenchMixin:
return (pc.mean(col0).as_py() or 0) + (pc.mean(col1).as_py() or 0)
- def _build_scenarios():
- """Build scenarios for SQL_WINDOW_AGG_ARROW_UDF.
-
- Returns a dict mapping scenario name to ``(groups, schema)``.
- """
- scenarios = {}
-
- for name, (num_groups, rows_per_group, n_cols) in {
- "few_groups_sm": (50, 5_000, 5),
- "few_groups_lg": (50, 50_000, 5),
- "many_groups_sm": (2_000, 500, 5),
- "many_groups_lg": (500, 10_000, 5),
- "wide_cols": (200, 5_000, 20),
- }.items():
- groups, schema = MockDataFactory.make_grouped_batches(
- num_groups=num_groups,
- num_rows=rows_per_group,
- num_cols=n_cols,
- spark_type_pool=MockDataFactory.NUMERIC_TYPES,
- batch_size=rows_per_group,
- )
- scenarios[name] = (groups, schema)
+ _scenario_configs = {
+ "few_groups_sm": (50, 5_000, 5),
+ "few_groups_lg": (50, 50_000, 5),
+ "many_groups_sm": (2_000, 500, 5),
+ "many_groups_lg": (500, 10_000, 5),
+ "wide_cols": (200, 5_000, 20),
+ }
- return scenarios
+ @staticmethod
+ def _build_scenario(name):
+ """Build a single scenario by name."""
+ np.random.seed(42)
+ num_groups, rows_per_group, n_cols =
_WindowAggArrowBenchMixin._scenario_configs[name]
+ return MockDataFactory.make_grouped_batches(
+ num_groups=num_groups,
+ num_rows=rows_per_group,
+ num_cols=n_cols,
+ spark_type_pool=MockDataFactory.NUMERIC_TYPES,
+ batch_size=rows_per_group,
+ )
- _scenarios = _build_scenarios()
_udfs = {
"sum_udf": _window_agg_arrow_sum,
"mean_multi_udf": _window_agg_arrow_mean_multi,
}
- params = [list(_scenarios), list(_udfs)]
+ params = [list(_scenario_configs), list(_udfs)]
param_names = ["scenario", "udf"]
def _write_scenario(self, scenario, udf_name, buf):
- groups, _schema = self._scenarios[scenario]
+ groups, _schema = self._build_scenario(scenario)
udf_func = self._udfs[udf_name]
# sum_udf uses 1 arg, mean_multi_udf uses 2 args
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]