gortiz opened a new pull request, #18741:
URL: https://github.com/apache/pinot/pull/18741

   Contributes to #18740 (umbrella: cost-based optimization for the multi-stage 
query engine).
   
   This PR implements **Phase 1** of the umbrella issue: the broker statistics 
foundation and a
   gated, cost-based join-reorder phase for the multi-stage engine. Everything 
is **off by default**
   and broker-local (no wire-format or mixed-version impact). The commits are 
structured so the PR
   can be split into the stacked PRs listed in the umbrella issue if reviewers 
prefer.
   
   ## What's included
   
   **Statistics foundation**
   - Statistics contracts: `PinotStatisticsProvider`, `TableStatistics`, 
`ColumnStatistics`,
     `StatConfidence` (planner SPI) and `StatsStore` / `ColumnStatsSource` 
(broker SPIs).
   - SQLite-backed `StatsStore`: off-heap persistence (WAL, prepared 
statements, Flyway-migrated
     schema), crc-based restart reconciliation, corruption auto-recovery, purge.
   - Broker T0 stats collection from the ZK segment metadata the broker already 
watches
     (`pinot.broker.stats.enabled`, default **false**): per-segment row count, 
size and time bounds
     via a `SegmentZkMetadataFetchListener`, with no extra ZK reads on the 
query path.
   - Table-type semantics so stats are never silently wrong: hybrid 
OFFLINE+REALTIME merge at the
     broker time boundary (no double counting), upsert/dedup row counts marked 
LOW confidence
     (physical docs over-count logical rows), consuming segments downgrade 
confidence. The planner
     treats LOW/UNKNOWN confidence as "no stats".
   
   **Planner integration**
   - `PinotTable#getStatistic()` (memoized) exposes row counts to Calcite; 
statistics provider is
     threaded through `QueryEnvironment.Config` (default no-op ⇒ zero behavior 
change).
   - A chained `RelMetadataProvider` with stats-backed row counts and 
selectivity, including
     time-range selectivity computed from segment time boundaries (the primary 
time column from the
     table config, applied only when its unit is epoch millis).
   - A rows-dominated `RelOptCost` implementation with a deterministic total 
order.
   
   **Cost-based join reordering** (`useJoinReorder` query option, default 
**false**)
   - Runs between the logical Hep phases and the trait/physical phases; drives
     `JOIN_TO_MULTI_JOIN` + `MULTI_JOIN_OPTIMIZE` (with project/filter merge) 
off the
     statistics-backed row counts.
   - Strict gates: inner joins only, no hinted joins, every scan must have a 
trusted row count,
     configurable join-count cap (`joinReorderMaxJoins`, default 10), and any 
failure falls back to
     the un-reordered plan (a reorder problem must never fail a query).
   
   **Independent fixes found along the way** (can be extracted to a separate PR 
on request)
   - `-configFile` was silently dropped by quickstarts overriding 
`getConfigOverrides()`
     (`MultistageEngineQuickStart`, `RealtimeQuickStart`, 
`TimeSeriesEngineQuickStart`).
   - Multiple `-bootstrapTableDir` directories were not bootstrapped despite 
the option declaring
     arity `1..*`.
   
   ## Measured impact
   
   TPC-H SF=1 (quickstart, 150 samples per variant, median with bootstrap 95% 
CI), queries written
   in a poor syntactic join order:
   
   | query | literal SQL order | with `useJoinReorder` | hand-optimized SQL |
   |---|---|---|---|
   | `customer⋈orders⋈lineitem` (filtered) | 549 ms [516, 570] | 462 ms [438, 
479] | 412 ms [390, 430] |
   | `region⋈nation⋈supplier⋈lineitem` (filtered) | 253 ms [244, 277] | **80 ms 
[80, 81]** | 207 ms [204, 211] |
   
   The second query's reordered plan (reduce the dimension chain first, single 
probe over the fact
   table) beats even the hand-optimized left-deep SQL by 2.6×. Results are 
identical across all
   variants; with the option disabled, plans are byte-identical to master 
(verified against the
   full plan-resource test suite).
   
   ## Testing
   
   - ~110 new unit tests across the stats store (incl. corruption recovery and 
concurrency
     semantics), table-type stat semantics, metadata provider/selectivity 
(incl. bound-conversion
     regression tests), cost ordering, and plan-level join-reorder tests 
(EXPLAIN diffs, hint veto,
     join-cap fallback, option plumbing).
   - Full `pinot-query-planner` module green (1339 tests), including the 574 
resource-based plan
     tests confirming default-off changes nothing.
   
   ## Notes for reviewers
   
   - Enabling `pinot.broker.stats.enabled` changes cardinality **estimates** 
visible to all MSE
     planner rules (documented on the config key); it never affects 
correctness. The join-reorder
     phase is additionally gated by its own query option.
   - New broker dependencies: `org.xerial:sqlite-jdbc`, 
`org.flywaydb:flyway-core` (8.x — the last
     line with built-in SQLite support). `LICENSE-binary` updated.
   - `StatsStore` is an interface; SQLite is the default implementation and the 
storage engine is
     swappable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to