Copilot commented on code in PR #13506:
URL: https://github.com/apache/lucene/pull/13506#discussion_r3270753953


##########
lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriter.java:
##########
@@ -59,23 +59,29 @@ private void applyDeletes(SegmentWriteState state, Fields 
fields) throws IOExcep
       FrozenBufferedUpdates.TermDocsIterator iterator =
           new FrozenBufferedUpdates.TermDocsIterator(fields, true);
 
+      // Since we have already applied delete-by-docID, liveDocs maybe not 
null.
+      if (state.liveDocs == null) {
+        state.liveDocs = new FixedBitSet(state.segmentInfo.maxDoc());
+        state.liveDocs.set(0, state.segmentInfo.maxDoc());
+      }

Review Comment:
   The new code eagerly allocates and fully initializes `state.liveDocs` 
whenever `deleteTerms` is non-empty, even if none of the delete terms match any 
postings (e.g., docIdUpto==0 or term not present). This can be significantly 
more expensive than the previous per-doc null check for segments with many docs 
but no effective deletions. Consider lazily allocating liveDocs only after 
confirming at least one doc will be cleared (eg, by peeking the first postings 
doc for a term and only then initializing liveDocs once per term/once overall).



##########
lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriter.java:
##########
@@ -59,23 +59,29 @@ private void applyDeletes(SegmentWriteState state, Fields 
fields) throws IOExcep
       FrozenBufferedUpdates.TermDocsIterator iterator =
           new FrozenBufferedUpdates.TermDocsIterator(fields, true);
 
+      // Since we have already applied delete-by-docID, liveDocs maybe not 
null.
+      if (state.liveDocs == null) {
+        state.liveDocs = new FixedBitSet(state.segmentInfo.maxDoc());
+        state.liveDocs.set(0, state.segmentInfo.maxDoc());
+      }
+
       segDeletes.forEachOrdered(
           (term, docId) -> {
             DocIdSetIterator postings = iterator.nextTerm(term.field(), 
term.bytes());
             if (postings != null) {
               assert docId < PostingsEnum.NO_MORE_DOCS;
               int doc;
               while ((doc = postings.nextDoc()) < docId) {
-                if (state.liveDocs == null) {
-                  state.liveDocs = new FixedBitSet(state.segmentInfo.maxDoc());
-                  state.liveDocs.set(0, state.segmentInfo.maxDoc());
-                }
                 if (state.liveDocs.getAndClear(doc)) {
                   state.delCountOnFlush++;
                 }
               }
             }
           });
+      // Set to null, if there is no delete.

Review Comment:
   The comment “Set to null, if there is no delete.” is 
misleading/grammatically incorrect (it’s checking `delCountOnFlush`, i.e., no 
hard deletes applied during flush). Consider rewording to “Set to null if no 
hard deletes were applied during flush.”
   



##########
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##########
@@ -208,6 +209,56 @@ static void addDoc(IndexWriter writer) throws IOException {
     writer.addDocument(doc);
   }
 
+  public void testLiveDocs() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriter writer =
+        new IndexWriter(
+            dir,
+            new IndexWriterConfig(new MockAnalyzer(random()))
+                .setMergePolicy(NoMergePolicy.INSTANCE));
+
+    Document doc;
+    doc = new Document();
+    doc.add(new KeywordField("_id", "1", Field.Store.NO));
+    doc.add(new KeywordField("version", "1", Field.Store.NO));
+    writer.addDocument(doc);
+
+    doc = new Document();
+    doc.add(new KeywordField("_id", "1", Field.Store.NO));
+    doc.add(new KeywordField("version", "2", Field.Store.NO));
+    writer.updateDocument(new Term("_id", "1"), doc);
+
+    doc = new Document();
+    doc.add(new KeywordField("foo", "1", Field.Store.NO));
+    doc.add(new KeywordField("version", "1", Field.Store.NO));
+    writer.addDocument(doc);
+
+    doc = new Document();
+    doc.add(new KeywordField("foo", "1", Field.Store.NO));
+    doc.add(new KeywordField("version", "2", Field.Store.NO));
+    writer.updateDocument(new Term("foo", "1"), doc);
+    writer.flush(); // delCountOnFlush > 0, liveDocs != null.
+
+    doc = new Document();
+    doc.add(new KeywordField("_id", "2", Field.Store.NO));
+    doc.add(new KeywordField("version", "1", Field.Store.NO));
+    writer.updateDocument(new Term("_id", "3"), doc);
+
+    doc = new Document();
+    doc.add(new KeywordField("foo", "2", Field.Store.NO));
+    doc.add(new KeywordField("version", "1", Field.Store.NO));
+    writer.updateDocument(new Term("foo", "3"), doc);
+    writer.flush(); // delCountOnFlush is 0, liveDocs is null.
+
+    DirectoryReader reader = writer.getReader(true, false);
+
+    assertEquals(2, ((StandardDirectoryReader) 
reader).getSegmentInfos().info(0).getDelCount());
+    assertEquals(0, ((StandardDirectoryReader) 
reader).getSegmentInfos().info(1).getDelCount());

Review Comment:
   This test assumes there are exactly 2 segments and directly indexes 
`info(0)` / `info(1)`, which will throw an `IndexOutOfBoundsException` if the 
flush behavior changes (making failures harder to diagnose). Consider asserting 
the segment count first (or iterating over `getSegmentInfos()` and asserting 
the expected delCounts) to produce clearer failures.
   



##########
lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriter.java:
##########
@@ -59,23 +59,29 @@ private void applyDeletes(SegmentWriteState state, Fields 
fields) throws IOExcep
       FrozenBufferedUpdates.TermDocsIterator iterator =
           new FrozenBufferedUpdates.TermDocsIterator(fields, true);
 
+      // Since we have already applied delete-by-docID, liveDocs maybe not 
null.

Review Comment:
   Minor grammar: “liveDocs maybe not null” should be “liveDocs may not be 
null” (or similar).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to