zhuqi-lucas opened a new pull request, #21956:
URL: https://github.com/apache/datafusion/pull/21956

   ## Which issue does this PR close?
   
   - Partial fix for #21399
   - Split from #21580 (per @alamb's request to break into smaller PRs)
   
   ## Rationale for this change
   
   TopK queries (`ORDER BY col LIMIT K`) on parquet files with multiple 
out-of-order row groups are suboptimal — the dynamic filter threshold converges 
slowly because row groups are read in arbitrary order. By reordering row groups 
so the "best" ones (containing optimal values) are read first, the threshold 
tightens quickly and subsequent row groups are pruned at runtime.
   
   ## What changes are included in this PR?
   
   **Row group reorder by statistics:**
   - `PreparedAccessPlan::reorder_by_statistics()`: sorts row groups by min 
values (ASC) using parquet column statistics. Direction (DESC) is handled by 
existing `reverse()` applied after reorder. The two compose correctly for both 
sorted and unsorted data.
   - `AccessPlanOptimizer` trait: extensible interface for row group access 
plan optimizations applied after pruning.
   
   **DynamicFilter sort metadata:**
   - `DynamicFilterPhysicalExpr` gains `sort_options` and `fetch` fields, set 
by `SortExec::create_filter()`. This lets the parquet reader determine reorder 
direction for any TopK query (not just sort-pushdown path).
   - Fix: `SortExec::with_fetch` now sets fetch before calling 
`create_filter()` so the DynamicFilter gets the correct K value.
   
   **File-level reorder:**
   - `FileSource::reorder_files()` trait method + parquet implementation: 
reorders files in the shared work queue by statistics so multi-file TopK reads 
the most promising files first.
   
   ## Are these changes tested?
   
   - SLT tests: Tests H (mixed RGs), J (scrambled non-overlapping), K 
(overlapping), L (multi-key ORDER BY)
   - All existing sort_pushdown SLTs pass
   - 98 parquet lib unit tests pass
   - Clippy clean, rustdoc clean
   
   ## Are there any user-facing changes?
   
   No API changes. TopK queries on parquet with multiple row groups will 
automatically benefit from better row group ordering. This is a performance 
optimization only — query results are unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to