siddharthteotia opened a new issue, #18568:
URL: https://github.com/apache/pinot/issues/18568
## Summary
With `SET enableNullHandling = true`, queries that use
`jsonExtractScalar(...)` (no default value argument) over JSON columns where
some rows have no value at the queried path fail with
`IllegalArgumentException("Cannot resolve JSON path on some records. Consider
setting a default value.")`. The same shape with `jsonExtractIndex(...)`
likewise throws (`RuntimeException("Cannot resolve JSON path...")`).
Both `JsonExtractScalarTransformFunction` and
`JsonExtractIndexTransformFunction` have `_nullHandlingEnabled` plumbed through
`init()` but the six typed SV methods
(`transformTo{Int,Long,Float,Double,BigDecimal,String}ValuesSV`) ignore it and
throw unconditionally.
Meanwhile, `JsonIndexDistinctOperator` (which `SELECT DISTINCT
jsonExtractIndex(...)` routes through when a JSON index is present) does honor
null handling — it folds unresolved rows into a single SQL `NULL` group via
`handleMissingDocs`. This creates a real divergence: same logical query,
different outcomes based on whether the DISTINCT operator route is eligible.
## Fixture used throughout
A `clicks` table with a single JSON column:
```
row 0: {"country": "US", "clicks": 5}
row 1: {"country": "CA", "clicks": 3}
row 2: {"country": null, "clicks": null}
row 3: {}
```
JSON index is present on `json`.
## Today on master — `enableNullHandling = true`
| # | Query shape | Default value | `jsonExtractScalar(json, '\$.country',
'STRING'[, …])` | `jsonExtractIndex(json, '\$.country', 'STRING'[, …])` |
Consistent? |
|---|---|---|---|---|---|
| 1 | **Projection**<br>`SELECT <expr> FROM clicks` | none | throws
`IllegalArgumentException("Cannot resolve JSON path on some
records…")`<br>*scalar transform fn* | throws `RuntimeException("Cannot resolve
JSON path…")`<br>*index transform fn* | ✓ both throw |
| 2 | **DISTINCT**<br>`SELECT DISTINCT <expr> FROM clicks` | none | throws
(same scalar transform path as row 1) | returns `"US"`, `"CA"`, `NULL`<br>*via
`JsonIndexDistinctOperator.handleMissingDocs` → `addNull()`* | ✗
**inconsistent** |
| 3 | **GROUP BY + COUNT**<br>`SELECT <expr> AS c, COUNT(*) FROM clicks
GROUP BY c` | none | throws (same scalar transform path as row 1) | throws
(same index transform path as row 1) | ✓ both throw |
| 4 | **Projection** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"`,
`"foobar"` | returns `"US"`, `"CA"`, `"foobar"`, `"foobar"` | ✓ |
| 5 | **DISTINCT** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"` |
returns `"US"`, `"CA"`, `"foobar"`<br>*via
`JsonIndexDistinctOperator.handleMissingDocs` →
`addValueToDistinctTable("foobar", …)`* | ✓ |
| 6 | **GROUP BY** | `'foobar'` | returns `("CA",1)`, `("US",1)`,
`("foobar",2)` | returns `("CA",1)`, `("US",1)`, `("foobar",2)` | ✓ |
## After fix — `enableNullHandling = true`
Both `JsonExtractScalarTransformFunction` and
`JsonExtractIndexTransformFunction` patched to honor `_nullHandlingEnabled`
when no default literal is supplied.
| # | Query shape | Default value | `jsonExtractScalar(json, '\$.country',
'STRING'[, …])` | `jsonExtractIndex(json, '\$.country', 'STRING'[, …])` |
Consistent? |
|---|---|---|---|---|---|
| 1 | **Projection** | none | returns `"US"`, `"CA"`, `NULL`,
`NULL`<br>*scalar transform fn — new null-emit branch* | returns `"US"`,
`"CA"`, `NULL`, `NULL`<br>*index transform fn — new null-emit branch* | ✓ |
| 2 | **DISTINCT** | none | returns `"US"`, `"CA"`, `NULL`<br>*scalar SV
result (with null bitmap) deduped by generic distinct operator* | returns
`"US"`, `"CA"`, `NULL`<br>*via `JsonIndexDistinctOperator.handleMissingDocs` →
`addNull()` (unchanged)* | ✓ |
| 3 | **GROUP BY + COUNT** | none | returns `("CA",1)`, `("US",1)`,
`(NULL,2)`<br>*scalar SV result (with null bitmap), then standard GROUP BY* |
returns `("CA",1)`, `("US",1)`, `(NULL,2)`<br>*index transform fn SV result
(with null bitmap), then standard GROUP BY* | ✓ |
| 4 | **Projection** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"`,
`"foobar"`<br>*unchanged — default-value branch* | returns `"US"`, `"CA"`,
`"foobar"`, `"foobar"`<br>*unchanged — default-value branch* | ✓ |
| 5 | **DISTINCT** | `'foobar'` | returns `"US"`, `"CA"`,
`"foobar"`<br>*unchanged* | returns `"US"`, `"CA"`, `"foobar"`<br>*unchanged* |
✓ |
| 6 | **GROUP BY** | `'foobar'` | returns `("CA",1)`, `("US",1)`,
`("foobar",2)`<br>*unchanged* | returns `("CA",1)`, `("US",1)`,
`("foobar",2)`<br>*unchanged* | ✓ |
**Change summary (NH = ON):** rows 1–3 in both function-name columns flip
from `throws` to a SQL `NULL` (or null group / null row). Rows 4–6 are
unchanged. Every cell is consistent across both function names post-fix.
## Today on master — `enableNullHandling = false`
| # | Query shape | Default value | `jsonExtractScalar(...)` |
`jsonExtractIndex(...)` | Consistent? |
|---|---|---|---|---|---|
| 1 | **Projection** | none | throws `IllegalArgumentException` | throws
`RuntimeException` (index transform fn) | ✓ both throw |
| 2 | **DISTINCT** | none | throws (scalar transform) | throws
`RuntimeException("Illegal Json Path: [...], for some docIds in segment
[...]")`<br>*via `JsonIndexDistinctOperator.handleMissingDocs` — NH-off,
no-default branch* | ✓ both throw |
| 3 | **GROUP BY** | none | throws (scalar transform) | throws (index
transform) | ✓ both throw |
| 4 | **Projection** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"`,
`"foobar"` | returns `"US"`, `"CA"`, `"foobar"`, `"foobar"` | ✓ |
| 5 | **DISTINCT** | `'foobar'` | returns `"US"`, `"CA"`, `"foobar"` |
returns `"US"`, `"CA"`, `"foobar"` | ✓ |
| 6 | **GROUP BY** | `'foobar'` | returns `("CA",1)`, `("US",1)`,
`("foobar",2)` | returns `("CA",1)`, `("US",1)`, `("foobar",2)` | ✓ |
With NH off, all cells are already consistent on master — the fix preserves
the legacy throw for the no-default case (gated on `_nullHandlingEnabled`).
## Proposed fix
In both `JsonExtractScalarTransformFunction` and
`JsonExtractIndexTransformFunction`:
1. When `_nullHandlingEnabled = true` and no default value was supplied:
emit SQL `NULL` (write the type's null placeholder + mark the row in
`getNullBitmap()`) for unresolved rows, instead of throwing.
2. When `_nullHandlingEnabled = false`: preserve the legacy throw.
3. MV transforms unchanged.
4. `JsonIndexDistinctOperator` already handles this correctly; no change
needed.
## Out of scope (separate follow-up)
- **4-arg with SQL `NULL` literal as default:**
`LiteralContext.getStringValue()` returns `""` for SQL `NULL`, causing
`JsonIndexDistinctOperator` to insert an empty string instead of a null group.
Separate, pre-existing inconsistency.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]