Copilot commented on code in PR #18680:
URL: https://github.com/apache/pinot/pull/18680#discussion_r3358914413
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java:
##########
@@ -165,6 +171,287 @@ private boolean isExclusive(Predicate.Type predicateType)
{
return predicateType == Predicate.Type.IS_NULL;
}
+ private boolean isDocIdMappingIdentity() {
+ if (_numFlattenedDocs != _numDocs) {
+ return false;
+ }
+ for (int docId = 0; docId < _numDocs; docId++) {
+ if (getDocId(docId) != docId) {
+ return false;
Review Comment:
`isDocIdMappingIdentity()` does an O(numDocs) full scan of the doc-id
mapping during reader construction. For large segments (and many segments per
server) this can noticeably increase segment-load time and fault in the mapping
buffer up-front, potentially offsetting the query-time savings.
Consider making the identity check cheaper (e.g., lazy/first-use, or
persisting an explicit identity flag in the index header at creation time) so
segment open doesn’t always pay the worst-case cost when the mapping is
identity.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java:
##########
@@ -165,6 +171,287 @@ private boolean isExclusive(Predicate.Type predicateType)
{
return predicateType == Predicate.Type.IS_NULL;
}
+ private boolean isDocIdMappingIdentity() {
+ if (_numFlattenedDocs != _numDocs) {
+ return false;
+ }
+ for (int docId = 0; docId < _numDocs; docId++) {
+ if (getDocId(docId) != docId) {
+ return false;
+ }
+ }
+ return true;
+ }
+
+ private MutableRoaringBitmap toDocIdBitmap(ImmutableRoaringBitmap
flattenedDocIds) {
+ if (_docIdMappingIdentity) {
+ return flattenedDocIds.toMutableRoaringBitmap();
+ }
Review Comment:
The new identity fast-path (`_docIdMappingIdentity` + `toDocIdBitmap()`
returning the flattened bitmap directly) is only exercised when a segment has
exactly one flattened record per document. The existing `JsonIndexTest` records
all include arrays (multiple flattened records), so this branch (and the
doc-id-index evaluation path gated by `_docIdMappingIdentity`) is likely
untested.
Please add a unit test that builds an immutable JSON index from JSON
documents without arrays and asserts correctness for a few predicate types
(e.g. EQ/IN/RANGE/REGEXP/IS_NULL).
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java:
##########
@@ -142,9 +155,21 @@ private void addFlattenedRecords(List<Map<String, String>>
records) {
_bytesSize += Utf8.encodedLength(keyValue);
return new RoaringBitmap();
}).add(_nextFlattenedDocId);
+ if (isDirectDocIdIndexEligibleKey(key)) {
+ if (directDocIdValues == null) {
+ directDocIdValues = new HashSet<>();
+ }
+ directDocIdValues.add(key);
+ directDocIdValues.add(keyValue);
+ }
}
_nextFlattenedDocId++;
}
+ if (!_docIdMappingIdentity && directDocIdValues != null) {
+ for (String value : directDocIdValues) {
+ _docIdPostingListMap.computeIfAbsent(value, k -> new
MutableRoaringBitmap()).add(_nextDocId);
+ }
+ }
Review Comment:
`_docIdPostingListMap` introduces additional MutableRoaringBitmaps (and
doc-id postings) but `_bytesSize`/`canAddMore()`/memory metric reporting still
only account for the string sizes added to `_postingListMap`. This can cause
the mutable JSON index to exceed `maxBytesSize` and under-report memory usage
once the doc-id posting list map is initialized/populated.
Consider updating the memory accounting/limit checks to include the extra
structures, or track a separate counter for the doc-id posting-list overhead
and report/enforce based on the combined total.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java:
##########
@@ -545,6 +822,9 @@ private LazyBitmap
getMatchingFlattenedDocIdsForKeyValue(Predicate predicate, St
}
public void convertFlattenedDocIdsToDocIds(Map<String, RoaringBitmap>
valueToFlattenedDocIds) {
+ if (_docIdMappingIdentity) {
+ return;
+ }
_readLock.lock();
Review Comment:
The `_docIdMappingIdentity` fast-path is checked before acquiring
`_readLock`. Because `_docIdMappingIdentity` is mutated under `_writeLock` and
is not `volatile`, a reader thread can observe a stale `true` here and
incorrectly skip conversion after the index transitions to non-identity (arrays
present), returning flattened doc-ids instead of segment doc-ids.
Move the identity check under the read lock (or make the flag `volatile` and
keep the existing lock discipline) to avoid a data race.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]