jpountz commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r752994952



##########
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
     }
   }
 
+  /** Merge impacts for combined field. */
+  static ImpactsSource mergeImpacts(
+      Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+      Map<String, List<Impacts>> fieldsWithImpacts,
+      Map<String, Float> fieldWeights) {
+    return new ImpactsSource() {
+
+      class SubIterator {
+        final Iterator<Impact> iterator;
+        int previousFreq;
+        Impact current;
+
+        SubIterator(Iterator<Impact> iterator) {
+          this.iterator = iterator;
+          this.current = iterator.next();
+        }
+
+        void next() {
+          previousFreq = current.freq;
+          if (iterator.hasNext() == false) {
+            current = null;
+          } else {
+            current = iterator.next();
+          }
+        }
+      }
+
+      @Override
+      public Impacts getImpacts() throws IOException {
+        // Use the impacts that have the lower next boundary (doc id in skip 
entry) as a lead for
+        // each field
+        // They collectively will decide on the number of levels and the block 
boundaries.
+        Map<String, Impacts> leadingImpactsPerField = new 
HashMap<>(fieldsWithImpactsEnums.size());
+
+        for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+            fieldsWithImpactsEnums.entrySet()) {
+          String field = fieldImpacts.getKey();
+          List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+          fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+          Impacts tmpLead = null;
+          // find the impact that has the lowest next boundary for this field
+          for (int i = 0; i < impactsEnums.size(); ++i) {
+            Impacts impacts = impactsEnums.get(i).getImpacts();
+            fieldsWithImpacts.get(field).add(impacts);
+
+            if (tmpLead == null || impacts.getDocIdUpTo(0) < 
tmpLead.getDocIdUpTo(0)) {
+              tmpLead = impacts;
+            }
+          }
+
+          leadingImpactsPerField.put(field, tmpLead);
+        }
+
+        return new Impacts() {
+
+          @Override
+          public int numLevels() {
+            // max of levels across fields' impactEnums
+            int result = 0;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              result = Math.max(result, impacts.numLevels());
+            }
+
+            return result;
+          }
+
+          @Override
+          public int getDocIdUpTo(int level) {
+            // min of docIdUpTo across fields' impactEnums
+            int result = Integer.MAX_VALUE;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              if (impacts.numLevels() > level) {
+                result = Math.min(result, impacts.getDocIdUpTo(level));
+              }
+            }
+
+            return result;
+          }

Review comment:
       Thanks for running these experiments, the performance does look much 
better now indeed! I would feel good about merging something that has these 
performance characteristics.
   
   > I assume by "highest weight" here, you meant term that has lower doc 
frequencies, as opposed to field weight?
   
   I meant field weight. Since we're using impacts from the field that has the 
highest weight and approximating impacts from other fields, my assumption was 
that it would help get better max score approximations.
   
   That said, I expect that it would ofter have the same result as picking the 
field that has the lowest doc freq since fields that have higher weight tend to 
have fewer terms per doc, so doc freqs are generally lower?
   
   > result = Math.max(result, impacts.getDocIdUpTo(level));
   
   When it was `Math.min`, it hurt us because we would have to recompute score 
boundaries often. But with `Math.max` we are forcing all fields but one to 
return impacts on the next level. My gut feeling is that returning the 
`getDocIdUpTo` of the field that has the highest weight (or lowest doc freq as 
you suggested) would perform better.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to