Re: [PR] Split taxonomy arrays across chunks [lucene]

via GitHub Sun, 07 Jan 2024 12:39:29 -0800


msfroh commented on code in PR #12995:
URL: https://github.com/apache/lucene/pull/12995#discussion_r1444069001



##########
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##########
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
     // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
     // NRT reader was obtained, even though nothing was changed. this is not 
very likely
     // to happen.
-    int[] copyParents = copyFrom.parents();
-    this.parents = new int[reader.maxDoc()];
-    System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-    initParents(reader, copyParents.length);
-
+    int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+    copyChunkedArray(copyFrom.parents.values, parentArray);
+    initParents(parentArray, reader, copyFrom.parents.length());
+    parents = new ChunkedArray(parentArray);
     if (copyFrom.initializedChildren) {
       initChildrenSiblings(copyFrom);
     }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+    int chunkCount = size / CHUNK_SIZE + 1;
+    int lastChunkSize = size % CHUNK_SIZE;
+    int[][] array = new int[chunkCount][];
+    if (array.length > 0) {
+      for (int i = 0; i < chunkCount - 1; i++) {
+        array[i] = new int[CHUNK_SIZE];
+      }
+      array[chunkCount - 1] = new int[lastChunkSize];
+    }
+    return array;
+  }
+
+  private static void copyChunkedArray(int[][] oldArray, int[][] newArray) {
+    // Copy all but the last (maybe partial) chunk from the old array
+    if (oldArray.length > 1) {
+      System.arraycopy(oldArray, 0, newArray, 0, oldArray.length - 1);
+    }
+    int[] lastCopyChunk = oldArray[oldArray.length - 1];
+    System.arraycopy(lastCopyChunk, 0, newArray[oldArray.length - 1], 0, 
lastCopyChunk.length);

Review Comment:
   The last chunk is the only one that changes -- the other chunks can be 
reused.
   
   Suppose the old array had size 20,000. It would look like:
   
   ```
   0 -> 8192 element chunk (c0)
   1 -> 8192 element chunk (c1)
   2 -> 3616 element chunk (c2)
   ```
   
   Suppose  I want to add 10 more elements. In the old way, that would mean 
allocating a whole new 20,010 element array and copying over the 20,000 
elements from the original array. In the new, chunked world, we allocate a new 
3-element `int[][]`, and populate it like:
   
   ```
   0 -> c0 (reusing the old array)
   1 -> c1 (reusing the old array)
   2 -> c2' (new 3626-element array).
   ```
   
   Now I copy over the 3616 elements from `c2` to `c2'`. The total "temporary" 
extra space is 3 references for the new top-level array, plus 3616 integers 
(since we're holding onto `c2` until we've copied it). The old model needed to 
allocate a temporary 20,000. elements.
   
   In general, where the old model required `n` temporary extra `int`s (so `4n` 
bytes plus a few bytes of array overhead), the new model needs `ceil(n/8192)` 
temporary pointers (the new `int[][]` array) plus up to 8192 extra `int`s for 
the last chunk.
   
   In terms of minimizing waste, assuming references are 32-bit, I suppose the 
ideal chunk size is the square root of `n` (if references are 64-bit, we would 
be better off with bigger chunks). If we want to more efficiently handle more 
than 67 million ordinals, I guess we could set chunk size to something bigger 
than 8192. I just copied the value from `IntBlockPool`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Split taxonomy arrays across chunks [lucene]

Reply via email to