mikemccand commented on code in PR #14494:
URL: https://github.com/apache/lucene/pull/14494#discussion_r2073325953


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java:
##########
@@ -201,17 +208,53 @@ void save(DataOutput meta, IndexOutput index) throws 
IOException {
     if (status != Status.BUILDING) {
       throw new IllegalStateException("only unsaved trie can be saved, got: " 
+ status);
     }
+    int[] labelMap = saveLabelDictionary(meta);
     meta.writeVLong(index.getFilePointer());
-    saveNodes(index);
+    saveNodes(index, labelMap);
     meta.writeVLong(root.fp);
     index.writeLong(0L); // additional 8 bytes for over-reading
     meta.writeVLong(index.getFilePointer());
     status = Status.SAVED;
   }
 
-  void saveNodes(IndexOutput index) throws IOException {
+  /**
+   * Save label dictionary and return a label map that narrow labels' value to 
a constant value
+   * range starts from 0.
+   */
+  private int[] saveLabelDictionary(DataOutput out) throws IOException {
+    int[] labels = new int[labelsSeen.cardinality()];
+    int label = -1;
+    for (int i = 0; i < labels.length; i++) {
+      label = labels[i] = labelsSeen.nextSetBit(label + 1);
+    }
+    assert label == labelsSeen.length() - 1
+        || labelsSeen.nextSetBit(label + 1) == DocIdSetIterator.NO_MORE_DOCS;
+    if (labels.length == 0 || labels[labels.length - 1] - labels[0] + 1 == 
labels.length) {
+      out.writeVInt(0);

Review Comment:
   Maybe add a comment?
   
   I think what this is doing is writing a `0` byte if there were no labels, 
or, if the labels were already fully dense/compact to begin with?  And 
something else (above) will be able to tell the difference?
   
   Edit: OK I see that `0` at read-time means "don't map", i.e. use the int 
label directly.  And this works for the "no labels at all" case because who 
cares what the mapping is if you will never use it (sort of like if a tree 
falls in the woods and nobody hears, did it make any sound?).



##########
lucene/core/src/test/org/apache/lucene/codecs/lucene103/blocktree/TestTrie.java:
##########
@@ -71,6 +74,46 @@ public void testOneByteTerms() throws Exception {
     }
   }
 
+  public void testContinuousValues() throws Exception {

Review Comment:
   Could we also explicitly test both ends (0, 255) directly?  Maybe 2-3 labels 
starting with 0, and ending with 255?  Trying to tease out any scary ob1-Kenobi 
bugs!



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java:
##########
@@ -201,17 +208,53 @@ void save(DataOutput meta, IndexOutput index) throws 
IOException {
     if (status != Status.BUILDING) {
       throw new IllegalStateException("only unsaved trie can be saved, got: " 
+ status);
     }
+    int[] labelMap = saveLabelDictionary(meta);
     meta.writeVLong(index.getFilePointer());
-    saveNodes(index);
+    saveNodes(index, labelMap);
     meta.writeVLong(root.fp);
     index.writeLong(0L); // additional 8 bytes for over-reading
     meta.writeVLong(index.getFilePointer());
     status = Status.SAVED;
   }
 
-  void saveNodes(IndexOutput index) throws IOException {
+  /**
+   * Save label dictionary and return a label map that narrow labels' value to 
a constant value

Review Comment:
   `constant` -> `compact`?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieReader.java:
##########
@@ -74,14 +77,39 @@ IndexInput floorData(TrieReader r) throws IOException {
   final RandomAccessInput access;
   final IndexInput input;
   final Node root;
+  final int[] labelMap;

Review Comment:
   Could you add a comment explaining a bit what's happening here?  Explain 
that it is global across the entire Trie, and it remaps the written labels to 
compact ordinals for more efficient storage w/ negligible impact on terms 
lookup performance?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java:
##########
@@ -201,17 +208,53 @@ void save(DataOutput meta, IndexOutput index) throws 
IOException {
     if (status != Status.BUILDING) {
       throw new IllegalStateException("only unsaved trie can be saved, got: " 
+ status);
     }
+    int[] labelMap = saveLabelDictionary(meta);
     meta.writeVLong(index.getFilePointer());
-    saveNodes(index);
+    saveNodes(index, labelMap);
     meta.writeVLong(root.fp);
     index.writeLong(0L); // additional 8 bytes for over-reading
     meta.writeVLong(index.getFilePointer());
     status = Status.SAVED;
   }
 
-  void saveNodes(IndexOutput index) throws IOException {
+  /**
+   * Save label dictionary and return a label map that narrow labels' value to 
a constant value
+   * range starts from 0.
+   */
+  private int[] saveLabelDictionary(DataOutput out) throws IOException {
+    int[] labels = new int[labelsSeen.cardinality()];
+    int label = -1;
+    for (int i = 0; i < labels.length; i++) {
+      label = labels[i] = labelsSeen.nextSetBit(label + 1);
+    }
+    assert label == labelsSeen.length() - 1
+        || labelsSeen.nextSetBit(label + 1) == DocIdSetIterator.NO_MORE_DOCS;
+    if (labels.length == 0 || labels[labels.length - 1] - labels[0] + 1 == 
labels.length) {
+      out.writeVInt(0);
+      return null;
+    }
+    int[] map = new int[BYTE_RANGE];
+    out.writeVInt(labels.length);
+    for (int i = 0; i < labels.length; i++) {
+      int l = labels[i];
+      out.writeByte((byte) l);
+      map[l] = i;
+    }
+    return map;
+  }
+
+  private void saveNodes(IndexOutput index, int[] labelMap) throws IOException 
{
     final long startFP = index.getFilePointer();
-    Deque<Node> stack = new ArrayDeque<>();
+    Deque<Node> stack =
+        new ArrayDeque<>() {
+          @Override
+          public void push(Node node) {
+            if (labelMap != null) {
+              node.label = labelMap[node.label];

Review Comment:
   Maybe `assert node.label != -1`?  This label must've been defined?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieReader.java:
##########
@@ -74,14 +77,39 @@ IndexInput floorData(TrieReader r) throws IOException {
   final RandomAccessInput access;
   final IndexInput input;
   final Node root;
+  final int[] labelMap;
 
-  TrieReader(IndexInput input, long rootFP) throws IOException {
+  static IOSupplier<TrieReader> readerSupplier(DataInput metaIn, IndexInput 
indexIn)
+      throws IOException {
+    int[] labelMap = TrieReader.labelMap(metaIn);
+    long start = metaIn.readVLong();
+    long rootFP = metaIn.readVLong();
+    long end = metaIn.readVLong();
+    return () -> new TrieReader(indexIn.slice("outputs", start, end - start), 
rootFP, labelMap);
+  }
+
+  private TrieReader(IndexInput input, long rootFP, int[] labelMap) throws 
IOException {
     this.access = input.randomAccessSlice(0, input.length());
+    this.labelMap = labelMap;
     this.input = input;
     this.root = new Node();
     load(root, rootFP);
   }
 
+  private static int[] labelMap(DataInput in) throws IOException {
+    int cnt = in.readVInt();
+    if (cnt == 0) {
+      return null;
+    } else {
+      int[] labelMap = new int[TrieBuilder.BYTE_RANGE];

Review Comment:
   Could we make this `byte[]` instad?  This added heap could add up for many 
shards X many indices for multitenant (OpenSearch, Elastcisearch, Solr, ...) 
cases, and with many separate indexed fields?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java:
##########
@@ -63,7 +67,7 @@ private enum Status {
   private static class Node {
 
     // The utf8 digit that leads to this Node, 0 for root node
-    private final int label;
+    private int label;

Review Comment:
   Hmm why did we need to un-final?  Don't we remap on write, not update each 
`Node`'s label in place?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieReader.java:
##########
@@ -190,9 +218,19 @@ Node lookupChild(int targetLabel, Node parent, Node child) 
throws IOException {
       return null;
     }
 
+    int lookUpLabel;
+    if (labelMap == null) {
+      lookUpLabel = targetLabel;
+    } else {
+      lookUpLabel = labelMap[targetLabel];
+      if (lookUpLabel == -1) {

Review Comment:
   Hmm is this paranoia?  Why would we write a label ord that wasn't mapped to 
a valid label?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to