Dandandan opened a new pull request, #21336:
URL: https://github.com/apache/datafusion/pull/21336

   ## Which issue does this PR close?
   
   N/A — performance optimization.
   
   ## Rationale for this change
   
   For `ORDER BY <single_primitive_column> LIMIT K` queries, the current TopK 
operator converts every incoming row through Arrow's `RowConverter` into 
heap-allocated `Vec<u8>` byte slices, then compares those slices. This has 
three sources of overhead:
   
   1. **RowConverter batch encoding** — converts arrays to row format even 
though only a scalar comparison is needed
   2. **Heap allocation per row** — each `TopKRow` stores a `Vec<u8>` for the 
sort key
   3. **Byte-slice comparison** — variable-length memcmp instead of a single 
native comparison
   
   ## What changes are included in this PR?
   
   Adds a specialized native TopK path (`TopKInner::Native`) for single-column 
sorts on primitive types (integers, floats, dates, timestamps, durations):
   
   - **`NativeTopKRow`**: stores the sort key as an inline `u128` (no heap 
allocation) with order-preserving encoding that handles ASC/DESC and NULLS 
FIRST/LAST
   - **`NativeTopKHeap`**: `BinaryHeap<NativeTopKRow>` with the same 
compaction/emit logic as the existing `TopKHeap`
   - **`find_new_native_topk_items`**: type-dispatched inner loop that 
downcasts to `PrimitiveArray<T>` once per batch, then encodes and compares 
native values directly
   - **`TopK`** now holds a `TopKInner` enum (`Row` | `Native`), automatically 
selecting the native path in `try_new` when applicable
   
   The existing row-based path is unchanged for multi-column sorts and 
non-primitive types.
   
   ## Are these changes tested?
   
   - All existing TopK unit tests pass (including 
`test_topk_marks_filter_complete` which exercises the native path via single 
Int32 column)
   - All TopK SQL logic tests pass
   - New unit tests for encoding correctness: signed/unsigned integer ordering, 
float ordering (including NaN/infinity/negative zero), and all four 
combinations of ASC/DESC × NULLS FIRST/LAST
   
   ## Are there any user-facing changes?
   
   No — this is a transparent performance optimization. Query results are 
unchanged.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to