Re: [I] [MSE] Leverage pre-sorted pre-partitioned data layout: streaming merge join and sorted group-by aggregation [pinot]

via GitHub Tue, 16 Jun 2026 23:51:21 -0700


rohityadav1993 commented on issue #18667:
URL: https://github.com/apache/pinot/issues/18667#issuecomment-4726887758


   Thanks @gortiz for the pointers. 
   > Pinot doesn't enforce or even track order across segments.
   
   My assessment is same about cross segments sort order and the limitation of 
multi-column sort. 
   **Cross segment sorted stream**
   The idea is as you mentioned, k-way heap merge across segments and utilise 
sorted index whenever available and sort in memory otherwise as per the query 
requirement. Implementation wise, the leaf executor should switch between the 
new combine node vs the existing 
`MinMaxValueBasedSelectionOrderByCombineOperator`
   
   **Multi-column sort**
   As part of this work I also wanted to propose multi-column sort leveraged by 
Minions and storing segment metadata to detect the same. While the realtime 
segments can commit based on the sort order for uploaded segments or segments 
with older table config we can rely on minions to re-sort the segments.
   A runtime monotonicity assertion on the leaf output (fail the query if a key 
regresses) is a good safety measure rather than trusting the hint alone.
   
   >This same leaf combine directly accelerates plain ORDER BY too.
   
   Yes, this was a motivation as well. A few engines like ClickHouse already do 
this. I haven't done the analysis yet, but sorting on important columns should 
also help reduce the heap overhead of indexes.
   
   I'm in progress of implementing a 
`StreamingMinMaxValueBasedSelectionOrderByCombineOperator` based on a k-way 
heap merge, plus the `SortedMailboxReceiveOperator` TODO you pointed at. 
Currently I'm ironing out lazily reading from segments in the combine operator, 
to avoid materializing complete segments in memory beyond what's needed 
upstream.
   
   >The min/max + null interaction will quietly hurt this feature's 
effectiveness.
   
   Thanks for the catch — this will be needed. The merge itself stays correct 
on `isSorted` alone, so this degrades the pruning speedup rather than 
correctness. In practice the majority of target query shapes key on non-null 
columns (e.g. `correlation_id`), so I'd like to make the streaming-merge POC 
work first and address null-independent min/max as a fast follow. For 
multi-column sorts, restricting pruning to the leading column keeps this 
tractable.
   
   Aligned on tracking min/max over non-null values (or storing a separate 
non-null min/max).
   
   >Two more correctness notes for the operators:
   
   **Null handling.** At the MSE join layer nulls arrive as real Java `null`s, 
so the operator filters them before any key comparison and implements `NULL ≠ 
NULL` directly (null left rows are null-padded on LEFT, null right rows are 
dropped) — `compareValue` never sees a null. I think we'll need to make 
`SelectionPlanNode` null-aware here — it currently skips `isSorted` entirely 
when null handling is enabled, so I'll look at supporting a null-aware sorted 
path instead.
   
   **Multi-column / leading-column sortedness.** Agreed this is limited to the 
leading column today. I'd like to propose first-class multi-column segment sort 
(tied to the proposal in the first point) — is there historical context for 
Pinot supporting only a single sorted column? I've been drawing on ClickHouse's 
[primary 
key](https://clickhouse.com/docs/best-practices/choosing-a-primary-key) / 
sort-key design.
   
   >How do you plan to enable this at the query level?
   
   Opt-in via hints, for exactly the reason you noted — segment sortedness 
isn't known at planning time. For the join that's the new 
`join_strategy='sorted'` enum value; the planner only checks equi-keys + 
INNER/LEFT shape and trusts the asserted sortedness, backed by the runtime 
monotonicity assertion from the first point.
   For aggregation, `is_partitioned_by_group_by_keys` only asserts partitioning 
(it emits a DIRECT aggregate and skips the exchange, but stays hash-based), so 
streaming sorted aggregation would need a separate opt-in asserting sortedness 
on the group-by keys.
   
   ## Next step:
   I'm putting up a POC with the leaf-stage implementation; I'll apply the 
pointers above and share findings so we can go over the code in the next 
discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [MSE] Leverage pre-sorted pre-partitioned data layout: streaming merge join and sorted group-by aggregation [pinot]

Reply via email to