EeshanBembi opened a new pull request, #16905:
URL: https://github.com/apache/datafusion/pull/16905

   ## Summary
   
   This PR optimizes sort execution by automatically using `PartialSortExec` 
instead of `SortExec` when input data is already sorted on a prefix of the 
requested sort columns. This addresses a significant performance issue 
described in #16899 where DataFusion would unnecessarily resort entire datasets 
even when partial sorting would suffice.
   
   ## Changes
   
   ### Physical Planning Optimization
   - **Smart Sort Detection**: Enhanced `DefaultPhysicalPlanner` to recognize 
when input data has compatible prefix ordering
   - **Safety Guards**: Added `is_safe_for_partial_sort_optimization()` with 
conservative checks to ensure ordering reliability across different plan types
   - **Prefix Matching**: Implemented `find_sort_prefix_length()` to calculate 
the longest compatible sort prefix between input and requested orderings
   
   ### Optimization Scope
   - **High Performance**: `PartialSortExec` sorts only within groups instead 
of resorting entire datasets, dramatically reducing memory footprint and CPU 
usage
   - **Reliable Application**: Conservative safety checks prevent optimization 
when ordering guarantees might be compromised (e.g., after certain join 
operations)
   - **Wide Coverage**: Supports optimization across table scans, projections, 
filters, window operations, aggregations, and merge joins
   
   ### Comprehensive Testing
   - **Unit Tests**: New test suite in `sort_optimization.rs` validates 
optimization behavior across multiple scenarios
   - **Integration Updates**: Refreshed expectations in `group_by.slt`, 
`joins.slt`, `order.slt`, `subquery_sort.slt`, `topk.slt`, and `window.slt` to 
reflect improved plans
   - **Memory Efficiency**: Updated memory limit tests to demonstrate 
`PartialSortExec` efficiency gains
   - **Real-world Validation**: Confirmed the optimization resolves the exact 
use case from issue #16899
   
   ## Performance Impact
   
   **Previous Behavior**: DataFusion would use `SortExec` to resort the 
complete dataset even when data was already partially ordered
   
   **Optimized Behavior**: `PartialSortExec` leverages existing order by 
sorting only within partition groups, delivering:
   - **Memory Efficiency**: Eliminates need to buffer complete datasets in 
memory
   - **CPU Optimization**: Reduces computational overhead by avoiding 
unnecessary full sorts  
   - **Query Speed**: Faster execution for queries with prefix-compatible sort 
requirements
   
   ## Example
   
   Reproducing the scenario from issue #16899, where a table sorted on column1 
requests sorting on column1, column2. Before this PR it would use SortExec for 
full resort, after this PR it uses PartialSortExec for efficient partial 
sorting with common_prefix_length=1.
   
   ## Testing
   
   - All existing tests pass 
   - New optimization tests added and passing  
   - Manual verification using DataFusion CLI confirms the optimization works 
as expected
   - Performance improvement verified for the reported use case
   
   Closes #16899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to