[PR] bench: shrink wide_schema benchmark defaults and make size tunable [datafusion]

via GitHub Mon, 08 Jun 2026 15:12:22 -0700


adriangb opened a new pull request, #22831:
URL: https://github.com/apache/datafusion/pull/22831


   ## Which issue does this PR close?
   
   - Relates to the wide-schema metadata-overhead work (#21968).
   
   ## Rationale for this change
   
   The `wide_schema` benchmark (added in #21970) defaults to **1024 columns × 
256 files**. On memory-constrained CI runners this drives peak RSS into the 
tens of GB and OOMs the bench bot.
   
   The cause is not working set — a full local run of all four wide queries at 
the original default peaks at only ~1 GB. It's **per-iteration allocator 
churn**: criterion runs the fast queries hundreds of thousands of times (e.g. 
`Q02 ... LIMIT 1` runs 1000s of iterations per sample window), and each 
iteration opens every file and parses its column-chunk metadata across all 
worker threads. The retained/fragmented memory scales with `num_files × 
width_factor`.
   
   ## What changes are included in this PR?
   
   - Reduce the defaults to **512 columns × 64 files** (8× less metadata 
churn). Local peak RSS for the full run drops from ~1.08 GB to ~0.70 GB; on a 
12-thread runner the proportional reduction brings it well within budget.
   - Expose the generator parameters as environment variables so runners can 
dial size up or down without editing `bench.sh`:
     - `WIDE_SCHEMA_WIDTH_FACTOR`
     - `WIDE_SCHEMA_NUM_FILES`
     - `WIDE_SCHEMA_ROWS_PER_FILE`
   - Update `bench.sh` and `benchmarks/README.md` accordingly.
   
   ## Are these changes tested?
   
   This is benchmark tooling only (no library code). Verified manually:
   
   - The full suite runs at the new defaults and produces valid results.
   - The benchmark's signal — the wide-vs-narrow ratio — is preserved at the 
smaller size:
   
     | Query | wide | narrow | ratio |
     | --- | --- | --- | --- |
     | Q01 (TopK) | 22.5 ms | 19.4 ms | 1.16× |
     | Q02 (`LIMIT 1`) | 3.13 ms | 1.55 ms | 2.02× |
     | Q03 | 3.57 ms | 1.86 ms | 1.92× |
     | Q04 | 13.1 ms | 7.09 ms | 1.85× |
   
   ## Are there any user-facing changes?
   
   No public API changes. Benchmark defaults and docs change only.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] bench: shrink wide_schema benchmark defaults and make size tunable [datafusion]

Reply via email to