adriangb opened a new pull request, #22831:
URL: https://github.com/apache/datafusion/pull/22831
## Which issue does this PR close?
- Relates to the wide-schema metadata-overhead work (#21968).
## Rationale for this change
The `wide_schema` benchmark (added in #21970) defaults to **1024 columns ×
256 files**. On memory-constrained CI runners this drives peak RSS into the
tens of GB and OOMs the bench bot.
The cause is not working set — a full local run of all four wide queries at
the original default peaks at only ~1 GB. It's **per-iteration allocator
churn**: criterion runs the fast queries hundreds of thousands of times (e.g.
`Q02 ... LIMIT 1` runs 1000s of iterations per sample window), and each
iteration opens every file and parses its column-chunk metadata across all
worker threads. The retained/fragmented memory scales with `num_files ×
width_factor`.
## What changes are included in this PR?
- Reduce the defaults to **512 columns × 64 files** (8× less metadata
churn). Local peak RSS for the full run drops from ~1.08 GB to ~0.70 GB; on a
12-thread runner the proportional reduction brings it well within budget.
- Expose the generator parameters as environment variables so runners can
dial size up or down without editing `bench.sh`:
- `WIDE_SCHEMA_WIDTH_FACTOR`
- `WIDE_SCHEMA_NUM_FILES`
- `WIDE_SCHEMA_ROWS_PER_FILE`
- Update `bench.sh` and `benchmarks/README.md` accordingly.
## Are these changes tested?
This is benchmark tooling only (no library code). Verified manually:
- The full suite runs at the new defaults and produces valid results.
- The benchmark's signal — the wide-vs-narrow ratio — is preserved at the
smaller size:
| Query | wide | narrow | ratio |
| --- | --- | --- | --- |
| Q01 (TopK) | 22.5 ms | 19.4 ms | 1.16× |
| Q02 (`LIMIT 1`) | 3.13 ms | 1.55 ms | 2.02× |
| Q03 | 3.57 ms | 1.86 ms | 1.92× |
| Q04 | 13.1 ms | 7.09 ms | 1.85× |
## Are there any user-facing changes?
No public API changes. Benchmark defaults and docs change only.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]