Jackie-Jiang opened a new pull request, #18496:
URL: https://github.com/apache/pinot/pull/18496
## Summary
When a forward-index update runs against a pre-1.6.0 segment, the
per-element length stats added in 1.6.0 (`LENGTH_OF_SHORTEST_ELEMENT`,
`LENGTH_OF_LONGEST_ELEMENT`, `MAX_ROW_LENGTH_IN_BYTES` for MV, `IS_ASCII` for
STRING) may be missing from `ColumnMetadata`. Today the per-op handlers in
`ForwardIndexHandler` cover for that with two ad-hoc scan paths, and the
`InvertedIndexAndDictionaryBasedForwardIndexCreator` rebuild path doesn't
backfill at all. After this PR the missing stats are gathered once during the
rebuild and persisted to `metadata.properties`, so future reloads see a
complete column metadata.
### `ForwardIndexHandler`
- `updateIndices` runs a per-column `backfillMissingLengthStatsForColumn`
pre-pass before dispatching ops. The helper opens the forward-index reader,
scans the column once, and persists the gathered stats. No-op for fixed-width
columns, columns whose metadata already carries the stats, or columns whose
forward index is absent on disk.
- The reader-driven scan in `rewriteForwardIndexForCompressionChange` and
the metadata-driven scan in `rewriteDictToRawForwardIndex` are removed. Per-op
methods construct their `IndexCreationContext` from `ColumnMetadata` directly —
the pre-pass guarantees the lengths are populated.
- The `IndexCreationContext.Builder` constructor already reads stats from
`ColumnMetadata`, so the explicit `withLengthOfLongestElement` /
`withMaxRowLengthInBytes` overrides on the per-op builders go away.
### `InvertedIndexAndDictionaryBasedForwardIndexCreator`
- Constructor now takes `FieldIndexConfigs` directly. The three derived
parameters (`dictionaryEnabled`, `dictionaryBasedForwardIndex`,
`forwardIndexConfig`) are gone — they were confusing because the two booleans
referred to *post-rebuild* state.
- `_dictionaryPresent` renamed to `_keepDictionary` with a clearer javadoc
that names the post-rebuild semantic ("kept vs. dropped").
- Both SV and MV paths now track per-element length stats from the
dictionary inline when the source segment is missing them, so the backfill
happens during the existing main loop with no second pass over the column. The
values flow into both the `IndexCreationContext.Builder` (so the new
forward-index creator sees them) and the persisted `metadataProperties` map.
- For MV var-length, `MAX_ROW_LENGTH_IN_BYTES` is always recomputed and
persisted — in-row duplicates can drop during the rebuild and change the
per-row max.
### `SegmentMetadataUtils.updateMetadataProperties`
- A `null` value now removes the property via `clearProperty` rather than
writing the literal string. Callers can drop a key by mapping it to `null`.
### `BaseIndexHandler.createForwardIndexIfNeeded`
- Simplified to pass `_fieldIndexConfigs.get(columnName)` directly to the
inv→fwd creator constructor instead of three derived locals.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]