rishabhmaurya commented on code in PR #13123:
URL: https://github.com/apache/lucene/pull/13123#discussion_r1513889998


##########
lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java:
##########
@@ -341,116 +344,94 @@ protected void compute() {
      */
     private boolean shuffle(
         ForwardIndex forwardIndex,
-        IntsRef left,
-        IntsRef right,
+        IntsRef docIDs,
+        int midPoint,
         int[] leftDocFreqs,
         int[] rightDocFreqs,
-        float[] gains,
+        float[] biases,
         int iter)
         throws IOException {
-      assert left.ints == right.ints;
-      assert left.offset + left.length == right.offset;
 
-      // Computing gains is typically a bottleneck, because each iteration 
needs to iterate over all
-      // postings to recompute gains, and the total number of postings is 
usually one order of
+      // Computing biases is typically a bottleneck, because each iteration 
needs to iterate over
+      // all postings to recompute biases, and the total number of postings is 
usually one order of
       // magnitude or more larger than the number of docs. So we try to 
parallelize it.
-      ComputeGainsTask leftGainsTask =
-          new ComputeGainsTask(
-              left.ints,
-              gains,
-              left.offset,
-              left.offset + left.length,
+      new ComputeBiasTask(
+              docIDs.ints,
+              biases,
+              docIDs.offset,
+              docIDs.offset + docIDs.length,
               leftDocFreqs,
               rightDocFreqs,
               threadLocal,
-              depth);
-      ComputeGainsTask rightGainsTask =
-          new ComputeGainsTask(
-              right.ints,
-              gains,
-              right.offset,
-              right.offset + right.length,
-              rightDocFreqs,
-              leftDocFreqs,
-              threadLocal,
-              depth);
-      if (shouldFork(docIDs.length, docIDs.ints.length)) {
-        invokeAll(leftGainsTask, rightGainsTask);
-      } else {
-        leftGainsTask.compute();
-        rightGainsTask.compute();
+              depth)
+          .compute();
+
+      float maxLeftBias = Float.NEGATIVE_INFINITY;
+      for (int i = docIDs.offset; i < midPoint; ++i) {
+        maxLeftBias = Math.max(maxLeftBias, biases[i]);
+      }
+      float minRightBias = Float.POSITIVE_INFINITY;
+      for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; 
++i) {
+        minRightBias = Math.min(minRightBias, biases[i]);
+      }
+      float gain = maxLeftBias - minRightBias;
+      // This uses the simulated annealing proposed by Mackenzie et al in 
"Tradeoff Options for
+      // Bipartite Graph Partitioning" by comparing the gain of swapping the 
doc from the left side
+      // that is most attracted to the right and the doc from the right side 
that is most attracted
+      // to the left against `iter` rather than zero.
+      if (gain <= iter) {
+        return false;
       }
 
-      class ByDescendingGainSorter extends IntroSorter {
+      new IntroSelector() {
 
         int pivotDoc;
-        float pivotGain;
+        float pivotBias;
 
         @Override
         protected void setPivot(int i) {
-          pivotDoc = left.ints[i];
-          pivotGain = gains[i];
+          pivotDoc = docIDs.ints[i];
+          pivotBias = biases[i];
         }
 
         @Override
         protected int comparePivot(int j) {
-          // Compare in reverse order to get a descending sort
-          int cmp = Float.compare(gains[j], pivotGain);
+          int cmp = Float.compare(pivotBias, biases[j]);

Review Comment:
   @jpountz To be precise, this is what I tried to avoid swaps as per the paper 
- 
   ```java
       int cmp = Float.compare(pivotBias-iter, biases[j]);
       if (cmp > 0) {
         return cmp;
       }
       cmp = Float.compare(pivotBias, biases[j]);
       if (cmp >= 0) {
         // Tie break on the doc ID to preserve doc ID ordering as much as 
possible
         cmp = pivotDoc - docIDs.ints[j];
       }
       return cmp;
   ```
   It won't be a precise kth largest but will be close to it with tolerance 
factor governed by `iter`. 
   
   I used 
[http_logs](https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/http_logs)
 workload and I see similar results as described in the paper with respect to 
time taken to perform reordering.
   
   For benchmark setup, I setup just a single shard and try to force merge into 
one segment after indexing and the corpus used is `logs-231998`. 
   
   Here is the result - 
   
   ### With this change
   ```shell
   Final index size: 686.3mb
   curl localhost:9200/_cat/indices
   green open logs-231998 T-HpeIWARuSZFGa74PWOig 1 0 11961342 0 686.3mb 686.3mb
   ```
   Some additional metrics I'm publishing 
[here](https://github.com/rishabhmaurya/lucene/commit/fda8f19ac691af832e4bd7b4722324a4f14971bf#diff-43ddf233a75c4de53d99ec83e5b52a5eff0c39e8d601ae6816db49c7c14f785aR1154),
 you can see **number of swaps to number of docs ratio is 8**
   ```
   Reorder stats
   totalDocs: 11961342; numTerms: 964; totalPostings: 88726436; 
totalSwapsPerformed: 96188516; postingsDocRatio: 7; swapsDocsRatio: 8
   12816206 9489609 12087839 12496439 9235544 9035264 8148933 6569703 5403994 
3803447 2775037 1757719 1117468 668304 374775 216838 120914 70483 0 0 0 0 0 0 0
   ```
   |                                                         Metric |           
          Task |       Value |   Unit |
   
|---------------------------------------------------------------:|-------------------------:|------------:|-------:|
   |                        Cumulative merge time of primary shards |           
               |     9.23488 |    min |
   |                       Cumulative merge count of primary shards |           
               |          20 |        |
   
   ### Without this change
   ```shell
   Final index size: 840.2mb
   curl localhost:9200/_cat/indices
   green open logs-231998 wZ8LNHeDQaCiLgOkJTjKxg 1 0 11961342 0 840.2mb 840.2mb
   ```
   **number of swaps to number of docs ratio is 47**
   ```
   Reorder stats
   totalDocs: 11961342; numTerms: 964; totalPostings: 88726436; 
totalSwapsPerformed: 571132294; postingsDocRatio: 7; swapsDocsRatio: 47
   47254463 55941681 48415352 38245831 39694052 36558063 37907697 33234795 
32316838 30361191 27886028 26467715 24669276 22745648 21136997 17510104 
16757532 14029031 0 0 0 0 0 0 0
   ```
   | Metric | Task | Value | Unit |
   | ---: | ---: | ---: | ---: |
   | Cumulative merge time of primary shards |  | 15.5215 | min |
   | Cumulative merge count of primary shards |  | 20 |  |
   
   So overall time to perform merge reduced from 15.5 mins to 9.23 mins. Also, 
number of swaps performed to docs ratio reduced significantly here. 
   
   ### Without reordering
   
   ```
   Final index size: 923.8mb
   curl localhost:9200/_cat/indices
   green open logs-231998 O0bppXZUREadOqVz80qJBw 1 0 11961342 0 923.8mb 
   ```
   
   | Metric | Task | Value | Unit |
   | ---: | ---: | ---: | ---: |
   | Cumulative merge time of primary shards |  | 1.13032 | min |
   | Cumulative merge count of primary shards |  | 21 |  |



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to