siddharthteotia opened a new pull request, #18561:
URL: https://github.com/apache/pinot/pull/18561

   **Draft — landed for design feedback. Open items called out below.**
   
   ## Summary
   
   Pushes `GROUP BY jsonExtractIndex(col, '$.path', 'TYPE') + COUNT(*)` into a 
dictionary scan over the JSON index instead of the forward-index + Jackson 
parse path. Work scales with the number of distinct values at the path (`D`), 
not with the number of matched documents (`M`). A runtime selectivity gate 
routes the query back to the standard `GroupByOperator` when `D > k × M`, so 
the new path is only chosen when it actually wins.
   
   ## What runs today
   
   Given a query like:
   
   ```sql
   SELECT jsonExtractIndex(payload, '$.country', 'STRING') AS country, COUNT(*)
   FROM events
   WHERE JSON_MATCH(payload, '"$.event_type" = ''click''')
   GROUP BY country
   ```
   
   `GroupByOperator` iterates the WHERE bitmap and, for each matched doc, reads 
the raw JSON from the forward index, parses it with Jackson, extracts the path, 
and hashes into the group map. For 5M matched docs and 200 countries, that's 5M 
parses for what the JSON index already knows.
   
   ## What this PR does
   
   1. **New operator** `JsonIndexGroupByOperator` 
(`pinot-core/.../operator/query/`). For each entry in the dictionary range 
covering the path, intersects the posting list with the WHERE bitmap via 
`RoaringBitmap.andCardinality` and emits `(value, count)`. Zero forward-index 
reads, zero JSON parses.
   2. **Shared parsing helper** `JsonExtractIndexUtils` extracted from the 
existing `JsonIndexDistinctOperator` so both index-aware operators can share 
parsing + same-path JSON_MATCH push-down logic. DISTINCT operator behavior is 
unchanged.
   3. **Same-path JSON_MATCH push-down.** A WHERE predicate on the same path as 
the GROUP BY key gets pushed into the index lookup. Cross-column / cross-path 
filters are applied as a residual bitmap intersection.
   4. **IS_NULL safety.** A same-path JSON_MATCH that could match missing-path 
docs is NOT forwarded into the index lookup, so correctness no longer depends 
on implementation-specific "returns empty map" behavior of the reader SPI.
   5. **Selectivity gate** in `canUse(...)`. Compares path cardinality (`D`) to 
matched-doc count (`M`); routes to `JsonIndexGroupByOperator` only when `D ≤ 
SELECTIVITY_THRESHOLD × M`. New SPI method 
`JsonIndexReader.getDistinctValueCountForPath(path)` provides the cheap `D` 
estimate (`ImmutableJsonIndexReader` answers in O(log N) via the dictionary 
range; `MutableJsonIndexImpl` answers via the `TreeMap` sub-range; default 
delegates to materializing the value set for third-party readers).
   6. **`GroupByPlanNode` refactored** to build the filter operator once and 
reuse it for either path.
   7. **JMH benchmark** `BenchmarkJsonIndexGroupByCount` sweeps 
`(pathCardinality × matchedFraction)` to empirically settle 
`SELECTIVITY_THRESHOLD`. Current value (`2.0`) is a placeholder pending the 
benchmark numbers.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to