Re: [PR] Optimize JSON index doc id mapping [pinot]

via GitHub Thu, 04 Jun 2026 14:09:30 -0700


Copilot commented on code in PR #18680:
URL: https://github.com/apache/pinot/pull/18680#discussion_r3358914413



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java:
##########
@@ -165,6 +171,287 @@ private boolean isExclusive(Predicate.Type predicateType) 
{
     return predicateType == Predicate.Type.IS_NULL;
   }
 
+  private boolean isDocIdMappingIdentity() {
+    if (_numFlattenedDocs != _numDocs) {
+      return false;
+    }
+    for (int docId = 0; docId < _numDocs; docId++) {
+      if (getDocId(docId) != docId) {
+        return false;

Review Comment:
   `isDocIdMappingIdentity()` does an O(numDocs) full scan of the doc-id 
mapping during reader construction. For large segments (and many segments per 
server) this can noticeably increase segment-load time and fault in the mapping 
buffer up-front, potentially offsetting the query-time savings.
   
   Consider making the identity check cheaper (e.g., lazy/first-use, or 
persisting an explicit identity flag in the index header at creation time) so 
segment open doesn’t always pay the worst-case cost when the mapping is 
identity.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java:
##########
@@ -165,6 +171,287 @@ private boolean isExclusive(Predicate.Type predicateType) 
{
     return predicateType == Predicate.Type.IS_NULL;
   }
 
+  private boolean isDocIdMappingIdentity() {
+    if (_numFlattenedDocs != _numDocs) {
+      return false;
+    }
+    for (int docId = 0; docId < _numDocs; docId++) {
+      if (getDocId(docId) != docId) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  private MutableRoaringBitmap toDocIdBitmap(ImmutableRoaringBitmap 
flattenedDocIds) {
+    if (_docIdMappingIdentity) {
+      return flattenedDocIds.toMutableRoaringBitmap();
+    }

Review Comment:
   The new identity fast-path (`_docIdMappingIdentity` + `toDocIdBitmap()` 
returning the flattened bitmap directly) is only exercised when a segment has 
exactly one flattened record per document. The existing `JsonIndexTest` records 
all include arrays (multiple flattened records), so this branch (and the 
doc-id-index evaluation path gated by `_docIdMappingIdentity`) is likely 
untested.
   
   Please add a unit test that builds an immutable JSON index from JSON 
documents without arrays and asserts correctness for a few predicate types 
(e.g. EQ/IN/RANGE/REGEXP/IS_NULL).



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java:
##########
@@ -142,9 +155,21 @@ private void addFlattenedRecords(List<Map<String, String>> 
records) {
           _bytesSize += Utf8.encodedLength(keyValue);
           return new RoaringBitmap();
         }).add(_nextFlattenedDocId);
+        if (isDirectDocIdIndexEligibleKey(key)) {
+          if (directDocIdValues == null) {
+            directDocIdValues = new HashSet<>();
+          }
+          directDocIdValues.add(key);
+          directDocIdValues.add(keyValue);
+        }
       }
       _nextFlattenedDocId++;
     }
+    if (!_docIdMappingIdentity && directDocIdValues != null) {
+      for (String value : directDocIdValues) {
+        _docIdPostingListMap.computeIfAbsent(value, k -> new 
MutableRoaringBitmap()).add(_nextDocId);
+      }
+    }

Review Comment:
   `_docIdPostingListMap` introduces additional MutableRoaringBitmaps (and 
doc-id postings) but `_bytesSize`/`canAddMore()`/memory metric reporting still 
only account for the string sizes added to `_postingListMap`. This can cause 
the mutable JSON index to exceed `maxBytesSize` and under-report memory usage 
once the doc-id posting list map is initialized/populated.
   
   Consider updating the memory accounting/limit checks to include the extra 
structures, or track a separate counter for the doc-id posting-list overhead 
and report/enforce based on the combined total.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java:
##########
@@ -545,6 +822,9 @@ private LazyBitmap 
getMatchingFlattenedDocIdsForKeyValue(Predicate predicate, St
   }
 
   public void convertFlattenedDocIdsToDocIds(Map<String, RoaringBitmap> 
valueToFlattenedDocIds) {
+    if (_docIdMappingIdentity) {
+      return;
+    }
     _readLock.lock();

Review Comment:
   The `_docIdMappingIdentity` fast-path is checked before acquiring 
`_readLock`. Because `_docIdMappingIdentity` is mutated under `_writeLock` and 
is not `volatile`, a reader thread can observe a stale `true` here and 
incorrectly skip conversion after the index transitions to non-identity (arrays 
present), returning flattened doc-ids instead of segment doc-ids.
   
   Move the identity check under the read lock (or make the flag `volatile` and 
keep the existing lock discipline) to avoid a data race.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize JSON index doc id mapping [pinot]

Reply via email to