mikemccand commented on code in PR #15803:
URL: https://github.com/apache/lucene/pull/15803#discussion_r2911462243
##########
lucene/core/src/java/org/apache/lucene/index/ReaderUtil.java:
##########
@@ -89,4 +89,57 @@ public static int subIndex(int n, List<LeafReaderContext>
leaves) {
}
return hi;
}
+
+ /**
+ * Partitions sorted global doc IDs by leaf.
+ *
+ * @param sortedDocIds global doc IDs, must be sorted ascending
+ * @param leaves the index reader's leaves
+ * @return array indexed by leaf ord, containing global doc IDs for that
leaf (empty if no hits)
Review Comment:
I wonder why caller must sort? Maybe we can add a sugar method taking
`ScoreDoc[]` and it would extract the `int[] docids` and sort them then call
the "real" method?
##########
lucene/core/src/java/org/apache/lucene/index/ReaderUtil.java:
##########
@@ -89,4 +89,57 @@ public static int subIndex(int n, List<LeafReaderContext>
leaves) {
}
return hi;
}
+
+ /**
+ * Partitions sorted global doc IDs by leaf.
+ *
+ * @param sortedDocIds global doc IDs, must be sorted ascending
+ * @param leaves the index reader's leaves
+ * @return array indexed by leaf ord, containing global doc IDs for that
leaf (empty if no hits)
+ */
+ public static int[][] partitionByLeaf(int[] sortedDocIds,
List<LeafReaderContext> leaves) {
+ assert isSorted(sortedDocIds) : "sortedDocIds must be sorted ascending";
+ int numLeaves = leaves.size();
+ int[][] result = new int[numLeaves][];
+ if (sortedDocIds.length == 0) {
+ for (int i = 0; i < numLeaves; i++) {
+ result[i] = new int[0];
+ }
+ return result;
+ }
+ // Pass 1: Count and allocate
+ int hitsCount = 0;
+ int leafIdx = 0;
+ LeafReaderContext leaf = leaves.get(0);
+ int leafEnd = leaf.docBase + leaf.reader().maxDoc();
+ for (int docId : sortedDocIds) {
+ while (docId >= leafEnd) {
+ result[leafIdx] = new int[hitsCount];
+ hitsCount = 0;
+ leafIdx++;
+ leaf = leaves.get(leafIdx);
+ leafEnd = leaf.docBase + leaf.reader().maxDoc();
+ }
+ hitsCount++;
+ }
+ result[leafIdx] = new int[hitsCount];
+ // Fill remaining empty leaves
+ for (int i = leafIdx + 1; i < numLeaves; i++) {
+ result[i] = new int[0];
+ }
+ // Pass 2: Copy docIds to result arrays
+ int srcPos = 0;
+ for (int[] leafDocs : result) {
+ System.arraycopy(sortedDocIds, srcPos, leafDocs, 0, leafDocs.length);
+ srcPos += leafDocs.length;
+ }
+ return result;
+ }
+
+ private static boolean isSorted(int[] a) {
+ for (int i = 1; i < a.length; i++) {
+ if (a[i] < a[i - 1]) return false;
Review Comment:
Maybe `throw new AssertionError(...)` and include specifics (`docid X
appears after docid Y`)?
##########
lucene/core/src/test/org/apache/lucene/index/TestReaderUtil.java:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+
+public class TestReaderUtil extends LuceneTestCase {
+
+ public void testPartitionByLeafEmptyDocIds() throws IOException {
Review Comment:
Maybe add some more "shrink wrapped" / "unit-y" tests that directly test at
the `int[]`/`int[][]` layer instead of working through Lucene's indexing APIs?
Also, maybe add a randomized test to stress test ... random number of
segments, for each segment a random `maxDoc`, the query has random number of
hits, sometimes massive, perhaps matching entire index, etc.
##########
lucene/core/src/test/org/apache/lucene/index/TestReaderUtil.java:
##########
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.tests.util.LuceneTestCase;
+
+public class TestReaderUtil extends LuceneTestCase {
+
+ public void testPartitionByLeafEmptyDocIds() throws IOException {
+ try (Directory dir = newDirectory();
+ IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
+ writer.addDocument(new Document());
+ writer.addDocument(new Document());
+ try (DirectoryReader reader = DirectoryReader.open(writer)) {
+ List<LeafReaderContext> leaves = reader.leaves();
+ int[][] result = ReaderUtil.partitionByLeaf(new int[0], leaves);
+ assertEquals(leaves.size(), result.length);
+ for (int[] leaf : result) {
+ assertEquals(0, leaf.length);
+ }
+ }
+ }
+ }
+
+ public void testPartitionByLeafSingleSegment() throws IOException {
+ try (Directory dir = newDirectory();
+ IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
+ for (int i = 0; i < 10; i++) {
+ writer.addDocument(new Document());
+ }
+ try (DirectoryReader reader = DirectoryReader.open(writer)) {
+ List<LeafReaderContext> leaves = reader.leaves();
+ assertEquals(1, leaves.size());
+
+ int[] docIds = {0, 3, 5, 9};
+ int[][] result = ReaderUtil.partitionByLeaf(docIds, leaves);
+
+ assertEquals(1, result.length);
+ assertArrayEquals(docIds, result[0]);
+ }
+ }
+ }
+
+ public void testPartitionByLeafMultipleSegments() throws IOException {
+ try (Directory dir = newDirectory();
+ IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
+ for (int i = 0; i < 10; i++) {
+ writer.addDocument(new Document());
+ }
+ writer.commit();
+
+ // Create second segment
+ for (int i = 0; i < 10; i++) {
+ writer.addDocument(new Document());
+ }
+ writer.commit();
+
+ try (DirectoryReader reader = DirectoryReader.open(writer)) {
+ List<LeafReaderContext> leaves = reader.leaves();
+ assertEquals(2, leaves.size());
+
+ // Hits in both segments
+ int[] docIds = {2, 9, 10, 18};
+ int[][] result = ReaderUtil.partitionByLeaf(docIds, leaves);
+
+ assertEquals(2, result.length);
+ // First segment: docs 0-9
+ assertArrayEquals(new int[] {2, 9}, result[0]);
+ // Second segment: docs 10-19
+ assertArrayEquals(new int[] {10, 18}, result[1]);
+ }
+ }
+ }
+
+ public void testPartitionByLeafSkipsEmptySegments() throws IOException {
Review Comment:
Paranoia: I don't think Lucene will do this (anymore -- long ago this
monster lurked -- we would discover in-the-wild Lucene indices with them) but
maybe add explicit test for segments with 0 docs?
Lucene tries hard to never produce a 0-doc segment -- once all docs are
deleted (at flush, at merge, or at the periodic apply-deletes), it drops the
segment entirely.
##########
lucene/core/src/java/org/apache/lucene/index/ReaderUtil.java:
##########
@@ -89,4 +89,57 @@ public static int subIndex(int n, List<LeafReaderContext>
leaves) {
}
return hi;
}
+
+ /**
+ * Partitions sorted global doc IDs by leaf.
+ *
+ * @param sortedDocIds global doc IDs, must be sorted ascending
+ * @param leaves the index reader's leaves
+ * @return array indexed by leaf ord, containing global doc IDs for that
leaf (empty if no hits)
+ */
+ public static int[][] partitionByLeaf(int[] sortedDocIds,
List<LeafReaderContext> leaves) {
+ assert isSorted(sortedDocIds) : "sortedDocIds must be sorted ascending";
+ int numLeaves = leaves.size();
+ int[][] result = new int[numLeaves][];
+ if (sortedDocIds.length == 0) {
+ for (int i = 0; i < numLeaves; i++) {
+ result[i] = new int[0];
+ }
+ return result;
+ }
+ // Pass 1: Count and allocate
+ int hitsCount = 0;
+ int leafIdx = 0;
+ LeafReaderContext leaf = leaves.get(0);
+ int leafEnd = leaf.docBase + leaf.reader().maxDoc();
+ for (int docId : sortedDocIds) {
+ while (docId >= leafEnd) {
+ result[leafIdx] = new int[hitsCount];
+ hitsCount = 0;
+ leafIdx++;
+ leaf = leaves.get(leafIdx);
+ leafEnd = leaf.docBase + leaf.reader().maxDoc();
+ }
+ hitsCount++;
+ }
+ result[leafIdx] = new int[hitsCount];
+ // Fill remaining empty leaves
+ for (int i = leafIdx + 1; i < numLeaves; i++) {
+ result[i] = new int[0];
+ }
+ // Pass 2: Copy docIds to result arrays
+ int srcPos = 0;
+ for (int[] leafDocs : result) {
Review Comment:
You could maybe do this all in a single pass? When you detect transition to
next reader, you know how many hits prior reader had?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]