arunkumarucet opened a new pull request, #18891:
URL: https://github.com/apache/pinot/pull/18891
## Summary
Profiling a JSON analytics workload — `JSON_EXTRACT_INDEX(col, '\$.path',
...)` group-by and `JSON_MATCH(col, ...)` filters over a JSON-indexed column —
surfaced three hot spots in `ImmutableJsonIndexReader`, all in the per-query
value / doc-id materialization. This PR optimizes them with no behavior change.
### Changes
1. **`getValuesSV`** (~16% of query CPU): previously, for every value-block
it allocated a `RoaringBitmap` mask via `bitmapOf`, ran a `RoaringBitmap.and`
per distinct value, and populated an `Int2ObjectOpenHashMap` (find/get/insert).
The non-flattened path — the only path used by `jsonExtractIndex` SV — now
scatters values with a bounded `PeekableIntIterator`:
- **Dense, gap-free block** (the common full-scan case): the result
position is the doc-id offset, so values are written directly with no map, no
mask, and no per-value `and`.
- **Sparse block** (e.g. after a selective filter): a primitive
`Int2IntOpenHashMap` maps each doc id to its position once, then each value's
posting list is range-scanned within `[lo, hi]`.
2. **`convertFlattenedDocIdsToDocIds`** and the `JSON_MATCH`
**`getMatchingDocIds`** path both built result bitmaps with a per-element
`bitmap.add(getDocId(f))`, causing `RoaringArray.setContainerAtIndex` churn
(~10% of query CPU). The flattened → real doc-id mapping is monotonically
non-decreasing (the index flattens documents in doc-id order), so the mapped
doc ids are produced sorted and are now appended through an ordered
`RoaringBitmapWriter`, avoiding the per-element binary search and container
reallocation.
### Results
On a 1M-row segment a `JSON_EXTRACT_INDEX` group-by query drops from ~34ms
to ~16ms (~2x), and `setContainerAtIndex` disappears from the profile. Output
is identical.
### Testing
`JsonIndexTest`, `JsonExtractIndexTransformFunctionTest`, and
`JsonIndexDistinctOperatorQueriesTest` all pass (62 tests). `spotless:apply`
and `checkstyle:check` clean.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]