rohityadav1993 commented on issue #18667: URL: https://github.com/apache/pinot/issues/18667#issuecomment-4726887758
Thanks @gortiz for the pointers. > Pinot doesn't enforce or even track order across segments. My assessment is same about cross segments sort order and the limitation of multi-column sort. **Cross segment sorted stream** The idea is as you mentioned, k-way heap merge across segments and utilise sorted index whenever available and sort in memory otherwise as per the query requirement. Implementation wise, the leaf executor should switch between the new combine node vs the existing `MinMaxValueBasedSelectionOrderByCombineOperator` **Multi-column sort** As part of this work I also wanted to propose multi-column sort leveraged by Minions and storing segment metadata to detect the same. While the realtime segments can commit based on the sort order for uploaded segments or segments with older table config we can rely on minions to re-sort the segments. A runtime monotonicity assertion on the leaf output (fail the query if a key regresses) is a good safety measure rather than trusting the hint alone. >This same leaf combine directly accelerates plain ORDER BY too. Yes, this was a motivation as well. A few engines like ClickHouse already do this. I haven't done the analysis yet, but sorting on important columns should also help reduce the heap overhead of indexes. I'm in progress of implementing a `StreamingMinMaxValueBasedSelectionOrderByCombineOperator` based on a k-way heap merge, plus the `SortedMailboxReceiveOperator` TODO you pointed at. Currently I'm ironing out lazily reading from segments in the combine operator, to avoid materializing complete segments in memory beyond what's needed upstream. >The min/max + null interaction will quietly hurt this feature's effectiveness. Thanks for the catch — this will be needed. The merge itself stays correct on `isSorted` alone, so this degrades the pruning speedup rather than correctness. In practice the majority of target query shapes key on non-null columns (e.g. `correlation_id`), so I'd like to make the streaming-merge POC work first and address null-independent min/max as a fast follow. For multi-column sorts, restricting pruning to the leading column keeps this tractable. Aligned on tracking min/max over non-null values (or storing a separate non-null min/max). >Two more correctness notes for the operators: **Null handling.** At the MSE join layer nulls arrive as real Java `null`s, so the operator filters them before any key comparison and implements `NULL ≠ NULL` directly (null left rows are null-padded on LEFT, null right rows are dropped) — `compareValue` never sees a null. I think we'll need to make `SelectionPlanNode` null-aware here — it currently skips `isSorted` entirely when null handling is enabled, so I'll look at supporting a null-aware sorted path instead. **Multi-column / leading-column sortedness.** Agreed this is limited to the leading column today. I'd like to propose first-class multi-column segment sort (tied to the proposal in the first point) — is there historical context for Pinot supporting only a single sorted column? I've been drawing on ClickHouse's [primary key](https://clickhouse.com/docs/best-practices/choosing-a-primary-key) / sort-key design. >How do you plan to enable this at the query level? Opt-in via hints, for exactly the reason you noted — segment sortedness isn't known at planning time. For the join that's the new `join_strategy='sorted'` enum value; the planner only checks equi-keys + INNER/LEFT shape and trusts the asserted sortedness, backed by the runtime monotonicity assertion from the first point. For aggregation, `is_partitioned_by_group_by_keys` only asserts partitioning (it emits a DIRECT aggregate and skips the exchange, but stays hash-based), so streaming sorted aggregation would need a separate opt-in asserting sortedness on the group-by keys. ## Next step: I'm putting up a POC with the leaf-stage implementation; I'll apply the pointers above and share findings so we can go over the code in the next discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
