siddharthteotia opened a new pull request, #18561: URL: https://github.com/apache/pinot/pull/18561
**Draft — landed for design feedback. Open items called out below.** ## Summary Pushes `GROUP BY jsonExtractIndex(col, '$.path', 'TYPE') + COUNT(*)` into a dictionary scan over the JSON index instead of the forward-index + Jackson parse path. Work scales with the number of distinct values at the path (`D`), not with the number of matched documents (`M`). A runtime selectivity gate routes the query back to the standard `GroupByOperator` when `D > k × M`, so the new path is only chosen when it actually wins. ## What runs today Given a query like: ```sql SELECT jsonExtractIndex(payload, '$.country', 'STRING') AS country, COUNT(*) FROM events WHERE JSON_MATCH(payload, '"$.event_type" = ''click''') GROUP BY country ``` `GroupByOperator` iterates the WHERE bitmap and, for each matched doc, reads the raw JSON from the forward index, parses it with Jackson, extracts the path, and hashes into the group map. For 5M matched docs and 200 countries, that's 5M parses for what the JSON index already knows. ## What this PR does 1. **New operator** `JsonIndexGroupByOperator` (`pinot-core/.../operator/query/`). For each entry in the dictionary range covering the path, intersects the posting list with the WHERE bitmap via `RoaringBitmap.andCardinality` and emits `(value, count)`. Zero forward-index reads, zero JSON parses. 2. **Shared parsing helper** `JsonExtractIndexUtils` extracted from the existing `JsonIndexDistinctOperator` so both index-aware operators can share parsing + same-path JSON_MATCH push-down logic. DISTINCT operator behavior is unchanged. 3. **Same-path JSON_MATCH push-down.** A WHERE predicate on the same path as the GROUP BY key gets pushed into the index lookup. Cross-column / cross-path filters are applied as a residual bitmap intersection. 4. **IS_NULL safety.** A same-path JSON_MATCH that could match missing-path docs is NOT forwarded into the index lookup, so correctness no longer depends on implementation-specific "returns empty map" behavior of the reader SPI. 5. **Selectivity gate** in `canUse(...)`. Compares path cardinality (`D`) to matched-doc count (`M`); routes to `JsonIndexGroupByOperator` only when `D ≤ SELECTIVITY_THRESHOLD × M`. New SPI method `JsonIndexReader.getDistinctValueCountForPath(path)` provides the cheap `D` estimate (`ImmutableJsonIndexReader` answers in O(log N) via the dictionary range; `MutableJsonIndexImpl` answers via the `TreeMap` sub-range; default delegates to materializing the value set for third-party readers). 6. **`GroupByPlanNode` refactored** to build the filter operator once and reuse it for either path. 7. **JMH benchmark** `BenchmarkJsonIndexGroupByCount` sweeps `(pathCardinality × matchedFraction)` to empirically settle `SELECTIVITY_THRESHOLD`. Current value (`2.0`) is a placeholder pending the benchmark numbers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
