Jackie-Jiang opened a new pull request, #18588: URL: https://github.com/apache/pinot/pull/18588
## Summary Refines the two index-based `DISTINCT` operators added in #17872 / #17820 and rewrites their tests to drive the full broker → operator path. ### `JsonIndexDistinctOperator` - Argument validation moves into the constructor, mirroring `JsonExtractIndexTransformFunction.init`. The operator accepts 3-/4-/5-arg `jsonExtractIndex` calls (path, type, optional default, optional `JSON_MATCH` filter expression) and MV `_ARRAY` types (`INT_ARRAY`, `STRING_ARRAY`, etc.). - `canUseJsonIndexDistinct` is simplified to a function-name check; the planner routes every `jsonExtractIndex` call through the operator and lets the constructor surface invalid arguments as `IllegalArgumentException`. - The runtime path intersects per-value doc ids from the JSON index with the `WHERE`-clause filter through a single `remainingDocs` bitmap, so cross-path `JSON_MATCH`, base-column filters, and `IS NULL` filters all use the same code path. - `numDocsScanned` is populated from the filtered doc set (or total docs when the filter matches everything), so execution statistics line up with the scan/projection path. - New query option `jsonIndexDistinctSkipMissingPath`: when set, the operator skips parsing the 4-arg default, skips `remainingDocs` tracking, and suppresses the "Illegal Json Path" throw for the 3-arg form. Useful when the caller knows every doc has the path (or doesn't care about misses). ### `InvertedIndexDistinctOperator` - Caches `_totalDocs` in the constructor instead of recomputing per call. - DESC-sorted path short-circuits with `intersects` (boolean) rather than `getLongCardinality`, which is orders of magnitude cheaper on dense bitmaps. - Drops redundant `advanceIfNeeded(startDocId)` on the ASC sorted path and the redundant inner `filterIter.hasNext()` check. - Reports a correct `numDocsScanned` on the sorted / inverted paths (previously zero). - Inlines the `FilterPreparation` helper and renames `_numEntriesExamined` → `_numEntriesExaminedPostFilter` so the stats name matches its meaning. ### Tests - Both operators get a new queries-based test (`JsonIndexDistinctOperatorQueriesTest`, `InvertedIndexDistinctOperatorQueriesTest`) that drives `SELECT DISTINCT` through `BaseQueriesTest`, asserts on result tables, explain strings, and execution statistics (`numDocsScanned`, `numEntriesScannedInFilter`, `numEntriesScannedPostFilter`, `numTotalDocs`). - The older mock-based unit tests are removed — the queries tests cover the same behaviors against real segments. - All `OPTION(...)` syntax in the suites is converted to standard `SET a=b;` prefixes; repeated query strings are extracted into shared constants. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
