yashmayya opened a new pull request, #18658:
URL: https://github.com/apache/pinot/pull/18658
## Summary
Upgrades Apache Calcite from **1.40.0 to 1.42.0**. Pinot's master was never
moved to 1.41, so this folds in both the 1.40→1.41 and 1.41→1.42 deltas in a
single bump.
The bulk of the upgrade is a faithful re-sync of Pinot's customized SQL
parser grammar to upstream 1.42, plus a handful of targeted workarounds for
behavioral changes Calcite introduced across these two releases. No public
Pinot API or wire/segment format changes.
## Changes
### Dependency
- `pom.xml`: `calcite.version` → `1.42.0` (covers `calcite-core` and
`calcite-babel`).
- Pinned `joou-java-6` to `0.9.5` in `dependencyManagement` to resolve a
dependency-convergence conflict: `calcite-core` 1.42 pulls `joou-java-6` 0.9.5
while the transitive `avatica-core` 1.28.0 still wants 0.9.4.
- `LICENSE-binary`: bumped the Calcite/Avatica entries to 1.42.0 / 1.28.0,
added `org.jooq:joou-java-6:0.9.5`, and dropped the now-unbundled
`avatica-metrics` (Avatica 1.28 no longer pulls it).
### SQL parser codegen sync (`pinot-common/src/main/codegen`)
- Re-synced `templates/Parser.jj`, `config.fmpp`, and `default_config.fmpp`
to upstream Calcite 1.42, preserving every `PINOT CUSTOMIZATION` region.
- The new babel feature flags introduced upstream — `includeStarExclude`
(`SELECT * EXCLUDE/REPLACE`), `includeSelectBy` (`SELECT ... BY`), and
`includeIntervalWithoutQualifier` — are intentionally kept **OFF**. The grammar
is synced but inactive: the multi-stage engine has no downstream support for
these features, so enabling them would parse syntax the planner/runtime can't
execute. They are wired through `default_config.fmpp` so a future change can
flip them on deliberately.
- `UNSIGNED` is added as a non-reserved keyword (needed for the unsigned
integer types below).
### Behavioral workarounds for 1.41/1.42 changes
- **CALCITE-7189** (non-strict GROUP BY): 1.41+ BABEL enables MySQL-style
non-strict GROUP BY (wrapping non-grouped columns in `ANY_VALUE()`), but the
implementation NPEs when a window function is combined with GROUP BY (e.g.
`SELECT MIN(col) OVER() FROM t GROUP BY col`). `Validator` now uses a
`SqlDelegatingConformance` over BABEL that overrides `isNonStrictGroupBy()` to
`false`. This is also the semantically correct behavior for Pinot, which
requires all non-aggregated columns to appear in GROUP BY. The feature remains
present (un-reverted) in 1.42, so the override is retained.
- **CALCITE-7379** (decorrelation type assertion): the upstream fix does not
fully cover the correlated-subquery shapes Pinot produces — a
post-decorrelation `Litmus.THROW` type assertion still fires for a
*nullability-only* divergence. `PinotRelDecorrelator` (new, see below) relaxes
that one assertion: it logs a warning when the row types differ only in
nullability and continues, but still fails fast on any structural type change.
- **CALCITE-7351**:
`RelDataTypeSystem#getMaxNumericScale`/`getMaxNumericPrecision` became `final`.
`TypeSystem` drops the now-illegal overrides; the equivalent behavior is
preserved via the type-specific `getMaxScale`/`getMaxPrecision(DECIMAL)`
overrides Pinot already defines.
- **Filtered MIN/MAX nullability**: 1.42 exposes
`SqlOperatorBinding#hasEmptyGroup()`. `PinotMinMaxReturnTypeInference` now also
treats a possibly-empty group as nullable (alongside the existing
`getGroupCount() == 0` / `hasFilter()` checks), matching the runtime's
null-on-empty behavior for `MIN(x) FILTER (WHERE ...)`.
### Unsigned integer types (CALCITE-1466)
- BABEL now parses `UTINYINT/USMALLINT/UINTEGER/UBIGINT`. Pinot has no
native unsigned storage, so these are accepted and treated as their signed
equivalents everywhere a `SqlTypeName` switch handles integers:
- `RelToPlanNodeConverter` / v2 `PRelToPlanNodeConverter`: map to the
narrowest signed type that holds the full range — `UTINYINT`/`USMALLINT` →
`INT`, `UINTEGER`/`UBIGINT` → `LONG` (a signed `INT` would wrap `UINTEGER`
values above 2³¹).
- `PinotEvaluateLiteralRule`: fold a constant unsigned cast into its
signed-equivalent type — Calcite cannot build a `RexLiteral` of an unsigned
type, so without this `CAST(5 AS INTEGER UNSIGNED)` failed constant-folding.
- `TypeSystem.deriveSumType`: widen to (signed) `BIGINT` like the signed
integer types, so `SUM` over a sub-64-bit unsigned column accumulates into 64
bits instead of silently overflowing a 32-bit `INT`.
- `ArithmeticFunctionUtils.normalizeNumericType`: normalize to the signed
integral type so arithmetic over an unsigned operand stays integral rather than
widening to `DOUBLE`.
### New class
- `org.apache.pinot.calcite.sql2rel.PinotRelDecorrelator` — a minimal
subclass of Calcite's `RelDecorrelator` that exists solely to relax the
CALCITE-7379 assertion described above. It lives under the
`org.apache.calcite.sql2rel` package because the relevant members (`CorelMap`,
`decorrelate`, etc.) are package/protected-subclass visible in Calcite.
## Testing & validation
- `pinot-query-planner` unit suite: **1262/1262 pass**.
- `pinot-query-runtime` result-correctness vs H2 —
`ResourceBasedQueriesTest` **3571 pass / 0 fail** (6 pre-existing skips),
`QueryRunnerTest` **130/130**.
- `OfflineClusterIntegrationTest` run locally; updated
`testQueryWithRepeatedColumnsV2` to reflect that 1.42 now accepts repeated
columns in GROUP BY but still rejects ambiguous repeated columns in ORDER BY.
- New/updated regression tests pin every workaround: filtered MIN/MAX
nullability, the CALCITE-7379 decorrelation path (the structural-vs-nullability
divergence decision is extracted and unit-tested), the
non-strict-GROUP-BY-with-window NPE, unsigned-type cast acceptance,
`SUM`/arithmetic return types over unsigned operands, and single-stage unsigned
casts.
- The 8 `pinot-query-planner/src/test/resources/queries/*.json` EXPLAIN-plan
snapshots are mechanically regenerated (label/whitespace deltas from upstream
rule changes), not hand-edited.
## Behavior & compatibility notes
- **Repeated GROUP BY key now accepted (MSE):** under the multi-stage
engine, a query with a repeated grouping key (e.g. `SELECT x, COUNT(*) FROM t
GROUP BY x, x`) previously failed validation and now succeeds — Calcite 1.42
de-duplicates the repeated key. This is a benign relaxation (no
previously-working query breaks), but during a mixed-version rolling upgrade
the same query is rejected by a 1.40 broker and accepted by a 1.42 broker.
Repeated columns in `ORDER BY` are still rejected as ambiguous (covered by the
updated `testQueryWithRepeatedColumnsV2`).
- **Plan shape: semi → inner join in a few IN/EXISTS shapes.** The 1.41
subquery-decorrelation rework rewrites the outer semi-join to an inner join in
a small number of nested IN/semi-join shapes (visible in the regenerated
`JoinPlans.json`/`PhysicalOptimizerPlans.json`). This is a sound,
plan-equivalent rewrite — Calcite only applies it where the join's right input
is already distinct (e.g. fed by an aggregate) — and result-equivalence is
covered by the H2-comparison suites (`ResourceBasedQueriesTest`) and the
integration tests, which all pass.
- **New parseable bitwise infix operators (`<<`, `&`, `^`).** The faithful
parser sync makes these parseable for the first time (upstream-stock 1.42
additions to `BinaryRowOperator`, mapping to
`SqlStdOperatorTable.BIT_LEFT_SHIFT`/`BITAND_OPERATOR`/`BITXOR_OPERATOR`);
previously they were parse errors. No operator wiring is added by this PR.
End-to-end status differs by engine: in the **multi-stage engine**
`PinotOperatorTable` is a curated allow-list that does not register them, so
such expressions fail at operator resolution (validation) — same status as the
other synced-but-inert 1.42 grammar. In the **single-stage engine** the
canonical names of `&`/`^` (`bitand`/`bitxor`) coincide with Pinot's existing
`bitAnd`/`bitXor` scalar functions, so `a & b` / `a ^ b` may resolve to those
(a MySQL-style infix alias), while `<<` has no Pinot equivalent and errors.
This v1/v2 divergence is an inherent consequence of faithfully syncing the
upstream grammar; explicitly wiring up or rejec
ting these operators is left as a separate, deliberate decision.
- **Unsigned casts in the single-stage engine.** Because the single-stage
parser also uses BABEL, `CAST(x AS INTEGER UNSIGNED)` (and the other unsigned
types) is now parseable in v1 too. `DataTypeConversionFunctions.cast` maps the
unsigned type-literal to its signed equivalent (UTINYINT/USMALLINT → INT,
UINTEGER/UBIGINT → LONG), mirroring the multi-stage converter, so v1 and v2
treat unsigned casts consistently (previously the v1 path would have errored at
execution with "Unknown data type"). Covered by
`DataTypeConversionFunctionsTest#testCastToUnsignedTypes`.
- No config keys, SPI signatures, enum/DataType additions, JSON/Protobuf
fields, or DataTable/segment-version changes — nothing else has mixed-version
visibility.
## Notes for reviewers
- The synced-but-disabled grammar (EXCLUDE/REPLACE/SELECT-BY/bare-INTERVAL)
is deliberately inert — included to keep `Parser.jj` a faithful upstream sync
rather than a divergent fork. Flipping the three fmpp flags is a separate,
future decision.
- The upstream colon-path field-access grammar
(`AddOptionalColonPath`/`ColonBracketSegment`) was also synced and is likewise
inert under Pinot — but it is gated by the upstream conformance method
`SqlConformance.isColonFieldAccessAllowed()` (which returns `false` for BABEL),
**not** by an fmpp flag. These productions are byte-for-byte from upstream
Calcite 1.42.0 (verified against `calcite-1.42.0` `Parser.jj`), so they
intentionally carry no `PINOT CUSTOMIZATION` markers.
- All `PINOT CUSTOMIZATION` markers from the prior grammar are preserved.
- Oversized integer literal overflow (e.g. a `UBIGINT` literal above
`Long.MAX`) wraps modulo the target signed type, identical to the existing
single-stage behavior documented in `RequestUtils.getLiteral`. The final stored
value is the same regardless of where the narrowing happens, so the MSE literal
path is left consistent with single-stage rather than diverging.
- Follow-up (not in this PR): the unsigned→signed handling touches several
type-dispatch `switch`es (the two converters, `DataTypeConversionFunctions`,
`ArithmeticFunctionUtils.normalizeNumericType`, `TypeSystem.deriveSumType`).
Each maps to a different target enum and uses `case` labels (which must be
compile-time constants, so they can't delegate to a shared predicate), so the
per-switch listing is largely inherent. The one genuinely-collapsible
duplication is `RelToPlanNodeConverter.convertToColumnDataType` and the v2
`PRelToPlanNodeConverter.convertToColumnDataType`, which are byte-for-byte
identical `public static` copies — a pre-existing smell this diff merely
extends. Collapsing those two (have v2 delegate to v1) is worth a dedicated
refactor; left out here to keep the diff scoped to the version bump.
- The filtered MIN/MAX fix (`hasEmptyGroup()`) is a **planning-time**
type-nullability correction — it lets the query validate under 1.42's stricter
nullability checks. Pinot's `DataSchema`/`ColumnDataType` erases nullability,
so this does not change runtime values; the empty-filtered-group → NULL runtime
semantics are pre-existing and unchanged. The fix is covered by a compile-time
regression test (`testFilteredMinMaxAggregateNullability`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]