yashmayya opened a new pull request, #18658:
URL: https://github.com/apache/pinot/pull/18658

   ## Summary
   
   Upgrades Apache Calcite from **1.40.0 to 1.42.0**. Pinot's master was never 
moved to 1.41, so this folds in both the 1.40→1.41 and 1.41→1.42 deltas in a 
single bump.
   
   The bulk of the upgrade is a faithful re-sync of Pinot's customized SQL 
parser grammar to upstream 1.42, plus a handful of targeted workarounds for 
behavioral changes Calcite introduced across these two releases. No public 
Pinot API or wire/segment format changes.
   
   ## Changes
   
   ### Dependency
   - `pom.xml`: `calcite.version` → `1.42.0` (covers `calcite-core` and 
`calcite-babel`).
   - Pinned `joou-java-6` to `0.9.5` in `dependencyManagement` to resolve a 
dependency-convergence conflict: `calcite-core` 1.42 pulls `joou-java-6` 0.9.5 
while the transitive `avatica-core` 1.28.0 still wants 0.9.4.
   - `LICENSE-binary`: bumped the Calcite/Avatica entries to 1.42.0 / 1.28.0, 
added `org.jooq:joou-java-6:0.9.5`, and dropped the now-unbundled 
`avatica-metrics` (Avatica 1.28 no longer pulls it).
   
   ### SQL parser codegen sync (`pinot-common/src/main/codegen`)
   - Re-synced `templates/Parser.jj`, `config.fmpp`, and `default_config.fmpp` 
to upstream Calcite 1.42, preserving every `PINOT CUSTOMIZATION` region.
   - The new babel feature flags introduced upstream — `includeStarExclude` 
(`SELECT * EXCLUDE/REPLACE`), `includeSelectBy` (`SELECT ... BY`), and 
`includeIntervalWithoutQualifier` — are intentionally kept **OFF**. The grammar 
is synced but inactive: the multi-stage engine has no downstream support for 
these features, so enabling them would parse syntax the planner/runtime can't 
execute. They are wired through `default_config.fmpp` so a future change can 
flip them on deliberately.
   - `UNSIGNED` is added as a non-reserved keyword (needed for the unsigned 
integer types below).
   
   ### Behavioral workarounds for 1.41/1.42 changes
   - **CALCITE-7189** (non-strict GROUP BY): 1.41+ BABEL enables MySQL-style 
non-strict GROUP BY (wrapping non-grouped columns in `ANY_VALUE()`), but the 
implementation NPEs when a window function is combined with GROUP BY (e.g. 
`SELECT MIN(col) OVER() FROM t GROUP BY col`). `Validator` now uses a 
`SqlDelegatingConformance` over BABEL that overrides `isNonStrictGroupBy()` to 
`false`. This is also the semantically correct behavior for Pinot, which 
requires all non-aggregated columns to appear in GROUP BY. The feature remains 
present (un-reverted) in 1.42, so the override is retained.
   - **CALCITE-7379** (decorrelation type assertion): the upstream fix does not 
fully cover the correlated-subquery shapes Pinot produces — a 
post-decorrelation `Litmus.THROW` type assertion still fires for a 
*nullability-only* divergence. `PinotRelDecorrelator` (new, see below) relaxes 
that one assertion: it logs a warning when the row types differ only in 
nullability and continues, but still fails fast on any structural type change.
   - **CALCITE-7351**: 
`RelDataTypeSystem#getMaxNumericScale`/`getMaxNumericPrecision` became `final`. 
`TypeSystem` drops the now-illegal overrides; the equivalent behavior is 
preserved via the type-specific `getMaxScale`/`getMaxPrecision(DECIMAL)` 
overrides Pinot already defines.
   - **Filtered MIN/MAX nullability**: 1.42 exposes 
`SqlOperatorBinding#hasEmptyGroup()`. `PinotMinMaxReturnTypeInference` now also 
treats a possibly-empty group as nullable (alongside the existing 
`getGroupCount() == 0` / `hasFilter()` checks), matching the runtime's 
null-on-empty behavior for `MIN(x) FILTER (WHERE ...)`.
   
   ### Unsigned integer types (CALCITE-1466)
   - BABEL now parses `UTINYINT/USMALLINT/UINTEGER/UBIGINT`. Pinot has no 
native unsigned storage, so these are accepted and treated as their signed 
equivalents everywhere a `SqlTypeName` switch handles integers:
     - `RelToPlanNodeConverter` / v2 `PRelToPlanNodeConverter`: map to the 
narrowest signed type that holds the full range — `UTINYINT`/`USMALLINT` → 
`INT`, `UINTEGER`/`UBIGINT` → `LONG` (a signed `INT` would wrap `UINTEGER` 
values above 2³¹).
     - `PinotEvaluateLiteralRule`: fold a constant unsigned cast into its 
signed-equivalent type — Calcite cannot build a `RexLiteral` of an unsigned 
type, so without this `CAST(5 AS INTEGER UNSIGNED)` failed constant-folding.
     - `TypeSystem.deriveSumType`: widen to (signed) `BIGINT` like the signed 
integer types, so `SUM` over a sub-64-bit unsigned column accumulates into 64 
bits instead of silently overflowing a 32-bit `INT`.
     - `ArithmeticFunctionUtils.normalizeNumericType`: normalize to the signed 
integral type so arithmetic over an unsigned operand stays integral rather than 
widening to `DOUBLE`.
   
   ### New class
   - `org.apache.pinot.calcite.sql2rel.PinotRelDecorrelator` — a minimal 
subclass of Calcite's `RelDecorrelator` that exists solely to relax the 
CALCITE-7379 assertion described above. It lives under the 
`org.apache.calcite.sql2rel` package because the relevant members (`CorelMap`, 
`decorrelate`, etc.) are package/protected-subclass visible in Calcite.
   
   ## Testing & validation
   - `pinot-query-planner` unit suite: **1262/1262 pass**.
   - `pinot-query-runtime` result-correctness vs H2 — 
`ResourceBasedQueriesTest` **3571 pass / 0 fail** (6 pre-existing skips), 
`QueryRunnerTest` **130/130**.
   - `OfflineClusterIntegrationTest` run locally; updated 
`testQueryWithRepeatedColumnsV2` to reflect that 1.42 now accepts repeated 
columns in GROUP BY but still rejects ambiguous repeated columns in ORDER BY.
   - New/updated regression tests pin every workaround: filtered MIN/MAX 
nullability, the CALCITE-7379 decorrelation path (the structural-vs-nullability 
divergence decision is extracted and unit-tested), the 
non-strict-GROUP-BY-with-window NPE, unsigned-type cast acceptance, 
`SUM`/arithmetic return types over unsigned operands, and single-stage unsigned 
casts.
   - The 8 `pinot-query-planner/src/test/resources/queries/*.json` EXPLAIN-plan 
snapshots are mechanically regenerated (label/whitespace deltas from upstream 
rule changes), not hand-edited.
   
   ## Behavior & compatibility notes
   - **Repeated GROUP BY key now accepted (MSE):** under the multi-stage 
engine, a query with a repeated grouping key (e.g. `SELECT x, COUNT(*) FROM t 
GROUP BY x, x`) previously failed validation and now succeeds — Calcite 1.42 
de-duplicates the repeated key. This is a benign relaxation (no 
previously-working query breaks), but during a mixed-version rolling upgrade 
the same query is rejected by a 1.40 broker and accepted by a 1.42 broker. 
Repeated columns in `ORDER BY` are still rejected as ambiguous (covered by the 
updated `testQueryWithRepeatedColumnsV2`).
   - **Plan shape: semi → inner join in a few IN/EXISTS shapes.** The 1.41 
subquery-decorrelation rework rewrites the outer semi-join to an inner join in 
a small number of nested IN/semi-join shapes (visible in the regenerated 
`JoinPlans.json`/`PhysicalOptimizerPlans.json`). This is a sound, 
plan-equivalent rewrite — Calcite only applies it where the join's right input 
is already distinct (e.g. fed by an aggregate) — and result-equivalence is 
covered by the H2-comparison suites (`ResourceBasedQueriesTest`) and the 
integration tests, which all pass.
   - **New parseable bitwise infix operators (`<<`, `&`, `^`).** The faithful 
parser sync makes these parseable for the first time (upstream-stock 1.42 
additions to `BinaryRowOperator`, mapping to 
`SqlStdOperatorTable.BIT_LEFT_SHIFT`/`BITAND_OPERATOR`/`BITXOR_OPERATOR`); 
previously they were parse errors. No operator wiring is added by this PR. 
End-to-end status differs by engine: in the **multi-stage engine** 
`PinotOperatorTable` is a curated allow-list that does not register them, so 
such expressions fail at operator resolution (validation) — same status as the 
other synced-but-inert 1.42 grammar. In the **single-stage engine** the 
canonical names of `&`/`^` (`bitand`/`bitxor`) coincide with Pinot's existing 
`bitAnd`/`bitXor` scalar functions, so `a & b` / `a ^ b` may resolve to those 
(a MySQL-style infix alias), while `<<` has no Pinot equivalent and errors. 
This v1/v2 divergence is an inherent consequence of faithfully syncing the 
upstream grammar; explicitly wiring up or rejec
 ting these operators is left as a separate, deliberate decision.
   - **Unsigned casts in the single-stage engine.** Because the single-stage 
parser also uses BABEL, `CAST(x AS INTEGER UNSIGNED)` (and the other unsigned 
types) is now parseable in v1 too. `DataTypeConversionFunctions.cast` maps the 
unsigned type-literal to its signed equivalent (UTINYINT/USMALLINT → INT, 
UINTEGER/UBIGINT → LONG), mirroring the multi-stage converter, so v1 and v2 
treat unsigned casts consistently (previously the v1 path would have errored at 
execution with "Unknown data type"). Covered by 
`DataTypeConversionFunctionsTest#testCastToUnsignedTypes`.
   - No config keys, SPI signatures, enum/DataType additions, JSON/Protobuf 
fields, or DataTable/segment-version changes — nothing else has mixed-version 
visibility.
   
   ## Notes for reviewers
   - The synced-but-disabled grammar (EXCLUDE/REPLACE/SELECT-BY/bare-INTERVAL) 
is deliberately inert — included to keep `Parser.jj` a faithful upstream sync 
rather than a divergent fork. Flipping the three fmpp flags is a separate, 
future decision.
   - The upstream colon-path field-access grammar 
(`AddOptionalColonPath`/`ColonBracketSegment`) was also synced and is likewise 
inert under Pinot — but it is gated by the upstream conformance method 
`SqlConformance.isColonFieldAccessAllowed()` (which returns `false` for BABEL), 
**not** by an fmpp flag. These productions are byte-for-byte from upstream 
Calcite 1.42.0 (verified against `calcite-1.42.0` `Parser.jj`), so they 
intentionally carry no `PINOT CUSTOMIZATION` markers.
   - All `PINOT CUSTOMIZATION` markers from the prior grammar are preserved.
   - Oversized integer literal overflow (e.g. a `UBIGINT` literal above 
`Long.MAX`) wraps modulo the target signed type, identical to the existing 
single-stage behavior documented in `RequestUtils.getLiteral`. The final stored 
value is the same regardless of where the narrowing happens, so the MSE literal 
path is left consistent with single-stage rather than diverging.
   - Follow-up (not in this PR): the unsigned→signed handling touches several 
type-dispatch `switch`es (the two converters, `DataTypeConversionFunctions`, 
`ArithmeticFunctionUtils.normalizeNumericType`, `TypeSystem.deriveSumType`). 
Each maps to a different target enum and uses `case` labels (which must be 
compile-time constants, so they can't delegate to a shared predicate), so the 
per-switch listing is largely inherent. The one genuinely-collapsible 
duplication is `RelToPlanNodeConverter.convertToColumnDataType` and the v2 
`PRelToPlanNodeConverter.convertToColumnDataType`, which are byte-for-byte 
identical `public static` copies — a pre-existing smell this diff merely 
extends. Collapsing those two (have v2 delegate to v1) is worth a dedicated 
refactor; left out here to keep the diff scoped to the version bump.
   - The filtered MIN/MAX fix (`hasEmptyGroup()`) is a **planning-time** 
type-nullability correction — it lets the query validate under 1.42's stricter 
nullability checks. Pinot's `DataSchema`/`ColumnDataType` erases nullability, 
so this does not change runtime values; the empty-filtered-group → NULL runtime 
semantics are pre-existing and unchanged. The fix is covered by a compile-time 
regression test (`testFilteredMinMaxAggregateNullability`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to