[PR] bench: Scale sort benchmarks to 10M rows to exercise merge path [datafusion]

via GitHub Tue, 14 Apr 2026 13:00:00 -0700


mbutrovich opened a new pull request, #21630:
URL: https://github.com/apache/datafusion/pull/21630


   ## Which issue does this PR close?
   
   - Partially addresses #21543. Also needed to properly evaluate the 
ExternalSorter refactor in #21629, which improves the merge path.
   
   ## Rationale for this change
   
   Current sort benchmarks use 100K rows across 8 partitions (~12.5K rows per 
partition, ~100KB for integers). This falls below the 
`sort_in_place_threshold_bytes` (1MB), so the "sort partitioned" benchmarks 
always take the concat-and-sort-in-place path and never exercise the 
sort-then-merge path that dominates real workloads.
   
   ## What changes are included in this PR?
   
   Parameterizes the sort benchmark on input size, running each case at both 
100K rows (existing) and 10M rows (new). At 10M rows, each partition holds 
~1.25M rows (~10MB for integers), which exercises the merge path.
   
   - `INPUT_SIZE` constant replaced with `INPUT_SIZES` array: `[(100_000, 
"100k"), (10_000_000, "10M")]`
   - `DataGenerator` takes `input_size` as a constructor parameter
   - All stream generator functions accept `input_size`
   - Benchmark names include size label (e.g. `sort partitioned i64 100k`, 
`sort partitioned i64 10M`)
   - Data distribution and cardinality ratios are preserved across sizes
   
   ## Are these changes tested?
   
   Benchmark compiles and runs. No functional test changes.
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] bench: Scale sort benchmarks to 10M rows to exercise merge path [datafusion]

Reply via email to