adriangb opened a new pull request, #22439:
URL: https://github.com/apache/datafusion/pull/22439

   ## Summary
   
   `repartition_file_min_size` gates how aggressively `repartitioned()` splits 
file groups by byte range to fan a scan out across `target_partitions` worth of 
cores. At 10 MiB the default leaves several SF1-sized dimension tables (TPC-H 
\`part\` ≈ 24 MiB, TPC-DS \`customer_address\` ≈ 7 MiB, …) on a single 
partition, so any CPU-bound per-batch work in the scan (filter eval, dictionary 
expansion, etc.) is single-threaded even when the cluster has plenty of idle 
cores.
   
   At 1 MiB those same files split cleanly into \`target_partitions\` byte 
ranges. The cost (more \`open()\` calls, more metadata loads) is small in 
absolute terms (≤10 extra opens per file in the worst case, each amortised over 
the row-group / page-index reads) and the existing knob is still available for 
workloads where it matters.
   
   ## Benchmark numbers
   
   12-core, SF1, with the existing dynamic-filter-pushdown defaults preserved:
   
   | Suite | default (10 MiB) | with this PR (1 MiB) |
   |---|---|---|
   | TPC-H total | 841 ms | 776 ms |
   | TPC-H Q22 | ~30 ms | ~17 ms |
   | TPC-DS total | 11.0 s | 11.1 s |
   | ClickBench total | 21.7 s | 19.0 s |
   
   ## Test plan
   
   - [x] \`cargo test --test sqllogictests\` — all 472 files pass after the 
information_schema snapshot and a csv_files reset.
   - [ ] \`run benchmarks\`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to