Re: [I] [MSE] Leverage pre-sorted pre-partitioned data layout: streaming merge join and sorted group-by aggregation [pinot]

via GitHub Fri, 12 Jun 2026 06:07:41 -0700


gortiz commented on issue #18667:
URL: https://github.com/apache/pinot/issues/18667#issuecomment-4691535362


   Thanks for the detailed write-up @rohityadav1993 — the motivation is solid 
and the POC plan/EXPLAIN output is really helpful. A few thoughts on the 
foundation this needs to stand on, plus one cross-cutting issue I think 
materially affects how well it will perform.
   
   **Pinot doesn't enforce or even track order across segments.**
   It's worth making the underlying data-layout reality explicit, because the 
whole feature hinges on it:
   
   - The `segmentsConfig` sorted column only applies to **realtime**, and only 
sorts the consuming segment *at commit time* — a per-segment, single-column 
property.
   - For **offline**, a column is physically sorted iff the producer fed 
pre-sorted rows. Pinot detects this and records `isSorted` per column in 
`ColumnMetadata` (and builds a sorted forward index), so it's *knowable per 
segment*, but it is never *enforced* or *guaranteed*.
   - **Between segments, nothing is ordered or recorded.** The only 
cross-segment signal is per-column min/max.
   - A segment can have at most one physically sorted column, so 
**multi-column** join/group-by keys can rely on physical sortedness for the 
leading column at best.
   
   The consequence is that the **planner can't prove the leaf emits sorted 
data** — collation traits are derived from the query, not from segment layout. 
The only way to *guarantee* sorted input is to inject a sort (which is what the 
POC's `LeafStageSortJoinRule` does), and an injected unbounded sort is exactly 
what produces the materialization blowup in Challenge #1. So I think the real 
foundation of this feature isn't the join/aggregation operators — it's a 
**runtime-adaptive leaf combine** that k-way merges the segments already sorted 
on the key (free ordered scan from the sorted forward index) and only sorts the 
stragglers in memory, emitting incrementally. A k-way heap merge handles "no 
inter-segment order" naturally — it only needs per-segment sortedness, which we 
do record. Everything downstream in your diagrams is comparatively mechanical 
once that exists. I'd also add a **runtime monotonicity assertion** (fail the 
query if a key goes backwards) rather than trusting a h
 int, since offline sortedness can't be backed up by the engine.
   
   **This same leaf combine directly accelerates plain `ORDER BY` too.**
   Worth calling out because it broadens the payoff and reinforces doing the 
combine first: a streaming sorted combine that merges per-segment sorted 
forward indexes and emits incrementally is exactly what `ORDER BY [col] [LIMIT 
n]` wants — today `MinMaxValueBasedSelectionOrderByCombineOperator` 
materializes a single block, and the same k-way-merge work in 
`SortedMailboxReceiveOperator` (which already has a TODO for it) feeds ORDER BY 
as well. So the leaf-combine investment pays off three ways — sort-merge join, 
sorted group-by, and ordinary ORDER BY — independent of each other.
   
   **The min/max + null interaction will quietly hurt this feature's 
effectiveness.**
   This is independent of the feature, but I want to flag it because it 
directly limits the payoff. Pinot stores nulls as the type **default** value, 
and for ascending order that default is typically the minimum 
(`Integer.MIN_VALUE`, `Long.MIN_VALUE`, etc.). That means:
   
   - A segment with **even a single null** in the key column gets its recorded 
**min dragged down to the default**, so its min/max range artificially overlaps 
essentially every other segment.
   - Any optimization that uses min/max to **order segments** for streaming 
emission, or to **prune/skip** segments (the MinMax pruning in 
`MinMaxValueBasedSelectionOrderByCombineOperator` that the proposal wants to 
preserve), becomes ineffective — segments stop being separable by range, 
forcing a full merge and defeating the skip.
   
   So the merge itself stays correct (it only needs `isSorted`), but the 
min/max-driven *speedups* this feature wants to lean on degrade sharply in the 
presence of nulls. **This feature would be substantially more effective if we 
made segment min/max independent of null values** — i.e. compute and store 
min/max over non-null values only (or track them separately), instead of 
letting the null-default sentinel pollute the range. That's a generally useful 
improvement (it also helps the existing min/max ORDER BY pruning) and I think 
it should be called out as a near-prerequisite for getting good performance 
here, rather than discovered after the fact.
   
   **Two more correctness notes for the operators:**
   - The merge comparator must consult the **null vector**, not the stored 
default, or nulls mis-sort; and equi-join is `NULL ≠ NULL`. Hash join gets this 
for free; a merge join has to implement NULLS ordering + non-matching 
explicitly.
   - Multi-column keys are constrained to leading-column physical sortedness — 
worth stating as an explicit limitation (single-column keys like the funnel's 
`correlation_id` are the clean case).
   
   **How do you plan to enable this at the query level?**
   The join path opts in via a new `join_strategy='sorted'` hint (a proposed 
value — today the `JoinStrategy` enum only has `HASH`/`LOOKUP`/`AS_OF`), but 
for the aggregation you describe it as "activated when the leaf stage 
advertises sorted collation on the group-by keys" — and 
`is_partitioned_by_group_by_keys` only asserts *partitioning*, not 
*sortedness*. Could you clarify the intended enablement model? Specifically: is 
it purely **hint-driven opt-in** (user asserts sortedness, engine trusts + 
runtime-verifies), or do you envision **auto-enabling when the table config 
declares a sorted column**? Given that Pinot can't prove cross-segment/offline 
sortedness, I'd lean toward explicit opt-in backed by the runtime monotonicity 
assertion as the safety net — but curious how you're thinking about it, since 
it determines the correctness contract.
   
   Happy to help on the leaf-combine piece — I think that plus the 
min/max-vs-null fix are the two things that determine whether this lands as a 
real win or a narrow one.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [MSE] Leverage pre-sorted pre-partitioned data layout: streaming merge join and sorted group-by aggregation [pinot]

Reply via email to