gortiz commented on issue #18667: URL: https://github.com/apache/pinot/issues/18667#issuecomment-4691535362
Thanks for the detailed write-up @rohityadav1993 — the motivation is solid and the POC plan/EXPLAIN output is really helpful. A few thoughts on the foundation this needs to stand on, plus one cross-cutting issue I think materially affects how well it will perform. **Pinot doesn't enforce or even track order across segments.** It's worth making the underlying data-layout reality explicit, because the whole feature hinges on it: - The `segmentsConfig` sorted column only applies to **realtime**, and only sorts the consuming segment *at commit time* — a per-segment, single-column property. - For **offline**, a column is physically sorted iff the producer fed pre-sorted rows. Pinot detects this and records `isSorted` per column in `ColumnMetadata` (and builds a sorted forward index), so it's *knowable per segment*, but it is never *enforced* or *guaranteed*. - **Between segments, nothing is ordered or recorded.** The only cross-segment signal is per-column min/max. - A segment can have at most one physically sorted column, so **multi-column** join/group-by keys can rely on physical sortedness for the leading column at best. The consequence is that the **planner can't prove the leaf emits sorted data** — collation traits are derived from the query, not from segment layout. The only way to *guarantee* sorted input is to inject a sort (which is what the POC's `LeafStageSortJoinRule` does), and an injected unbounded sort is exactly what produces the materialization blowup in Challenge #1. So I think the real foundation of this feature isn't the join/aggregation operators — it's a **runtime-adaptive leaf combine** that k-way merges the segments already sorted on the key (free ordered scan from the sorted forward index) and only sorts the stragglers in memory, emitting incrementally. A k-way heap merge handles "no inter-segment order" naturally — it only needs per-segment sortedness, which we do record. Everything downstream in your diagrams is comparatively mechanical once that exists. I'd also add a **runtime monotonicity assertion** (fail the query if a key goes backwards) rather than trusting a h int, since offline sortedness can't be backed up by the engine. **This same leaf combine directly accelerates plain `ORDER BY` too.** Worth calling out because it broadens the payoff and reinforces doing the combine first: a streaming sorted combine that merges per-segment sorted forward indexes and emits incrementally is exactly what `ORDER BY [col] [LIMIT n]` wants — today `MinMaxValueBasedSelectionOrderByCombineOperator` materializes a single block, and the same k-way-merge work in `SortedMailboxReceiveOperator` (which already has a TODO for it) feeds ORDER BY as well. So the leaf-combine investment pays off three ways — sort-merge join, sorted group-by, and ordinary ORDER BY — independent of each other. **The min/max + null interaction will quietly hurt this feature's effectiveness.** This is independent of the feature, but I want to flag it because it directly limits the payoff. Pinot stores nulls as the type **default** value, and for ascending order that default is typically the minimum (`Integer.MIN_VALUE`, `Long.MIN_VALUE`, etc.). That means: - A segment with **even a single null** in the key column gets its recorded **min dragged down to the default**, so its min/max range artificially overlaps essentially every other segment. - Any optimization that uses min/max to **order segments** for streaming emission, or to **prune/skip** segments (the MinMax pruning in `MinMaxValueBasedSelectionOrderByCombineOperator` that the proposal wants to preserve), becomes ineffective — segments stop being separable by range, forcing a full merge and defeating the skip. So the merge itself stays correct (it only needs `isSorted`), but the min/max-driven *speedups* this feature wants to lean on degrade sharply in the presence of nulls. **This feature would be substantially more effective if we made segment min/max independent of null values** — i.e. compute and store min/max over non-null values only (or track them separately), instead of letting the null-default sentinel pollute the range. That's a generally useful improvement (it also helps the existing min/max ORDER BY pruning) and I think it should be called out as a near-prerequisite for getting good performance here, rather than discovered after the fact. **Two more correctness notes for the operators:** - The merge comparator must consult the **null vector**, not the stored default, or nulls mis-sort; and equi-join is `NULL ≠ NULL`. Hash join gets this for free; a merge join has to implement NULLS ordering + non-matching explicitly. - Multi-column keys are constrained to leading-column physical sortedness — worth stating as an explicit limitation (single-column keys like the funnel's `correlation_id` are the clean case). **How do you plan to enable this at the query level?** The join path opts in via a new `join_strategy='sorted'` hint (a proposed value — today the `JoinStrategy` enum only has `HASH`/`LOOKUP`/`AS_OF`), but for the aggregation you describe it as "activated when the leaf stage advertises sorted collation on the group-by keys" — and `is_partitioned_by_group_by_keys` only asserts *partitioning*, not *sortedness*. Could you clarify the intended enablement model? Specifically: is it purely **hint-driven opt-in** (user asserts sortedness, engine trusts + runtime-verifies), or do you envision **auto-enabling when the table config declares a sorted column**? Given that Pinot can't prove cross-segment/offline sortedness, I'd lean toward explicit opt-in backed by the runtime monotonicity assertion as the safety net — but curious how you're thinking about it, since it determines the correctness contract. Happy to help on the leaf-combine piece — I think that plus the min/max-vs-null fix are the two things that determine whether this lands as a real win or a narrow one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
