[GitHub] [solr] magibney commented on a diff in pull request #1184: SOLR-16555: SolrIndexSearcher - FilterCache intersections/andNot should not clone bitsets repeatedly

GitBox Mon, 21 Nov 2022 20:30:14 -0800


magibney commented on code in PR #1184:
URL: https://github.com/apache/solr/pull/1184#discussion_r1028792927



##########
solr/core/src/java/org/apache/solr/search/BitDocSet.java:
##########
@@ -194,19 +194,31 @@ public void addAllTo(FixedBitSet target) {
 
   @Override
   public DocSet andNot(DocSet other) {
-    FixedBitSet newbits = bits.clone();
+    FixedBitSet newbits = getFixedBitSetClone();
+    andNot(newbits, other);
+    return new BitDocSet(newbits);
+  }
+
+  /**
+   * Helper method for andNot that takes FixedBitSet and DocSet. This modifies 
the provided
+   * FixedBitSet to remove all bits contained in the DocSet argument -- 
equivalent to calling
+   * a.andNot(b), but modifies the state of the FixedBitSet instead of 
returning a new FixedBitSet.
+   *
+   * @param bits FixedBitSet to operate on
+   * @param other The DocSet to compare to
+   */
+  protected static void andNot(FixedBitSet bits, DocSet other) {
     if (other instanceof BitDocSet) {
-      newbits.andNot(((BitDocSet) other).bits);
+      bits.andNot(((BitDocSet) other).bits);
     } else {
       DocIterator iter = other.iterator();
       while (iter.hasNext()) {
         int doc = iter.nextDoc();
-        if (doc < newbits.length()) {
-          newbits.clear(doc);
+        if (doc < bits.length()) {
+          bits.clear(doc);

Review Comment:
   it occurs to me: now that this method no longer returns the FixedBitSet, it 
could easily return `boolean`, something along the lines of `true: may have 
been modified`, `false: definitely not modified`, detecting the latter for the 
non-BitDocSet case by calling `bits.getAndClear(int)`. Not sure it'd be worth 
it, but thought it might be worth mentioning ... (could use `false` return 
value to avoid resetting size, which I _think_ actually might make.a difference 
for some use cases: e.g., a cached positive filter and small (non-BitDocSet) 
negative filters that don't actually exclude anything). 



##########
solr/core/src/java/org/apache/solr/search/MutableBitDocSet.java:
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.search;
+
+import org.apache.lucene.util.FixedBitSet;
+
+/**
+ * A {@link BitDocSet} based implementation that mutates the underlying bits 
for andNot and
+ * intersection. This allows for computing the combinations of sets without 
duplicating the
+ * underlying array. This MutableBitDocSet should not be cached because it can 
be modified.
+ *
+ * @since solr 9.2
+ */
+class MutableBitDocSet extends BitDocSet {
+  private MutableBitDocSet(FixedBitSet bits) {
+    super(bits);
+  }
+
+  /**
+   * Returns a mutable BitDocSet that is a copy of the provided BitDocSet.
+   *
+   * @param bitDocSet a BitDocSet
+   * @return copy of bitDocSet that is now mutable
+   */
+  public static MutableBitDocSet fromBitDocSet(BitDocSet bitDocSet) {
+    return new MutableBitDocSet(bitDocSet.getFixedBitSetClone());

Review Comment:
   It will often (indeed, usually?) be the case that we have `size` already 
calculated on the input `bitDocSet`. It could be worth passing this along using 
the `BitDocSet(FixedBitSet bits, int size)` ctor? This would only have a 
practical benefit in the case where you have one positive filter and a bunch of 
small negative filters that don't actually modify the set (in conjunction with 
the suggestion to make `andNot(FixedBitSet bits, DocSet other)` return 
boolean). But it costs nothing to preserve this information if it already 
exists.



##########
solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java:
##########
@@ -1190,38 +1192,52 @@ public ProcessedFilter getProcessedFilter(DocSet 
setFilter, List<Query> queries)
       }
 
       Query posQuery = QueryUtils.getAbs(q);
-      sets[end] = getPositiveDocSet(posQuery);
+      DocSet docSet = getPositiveDocSet(posQuery);
       // Negative query if absolute value different from original
       if (Objects.equals(q, posQuery)) {
-        neg[end] = false;
-        // keep track of the smallest positive set.
-        // This optimization is only worth it if size() is cached, which it 
would
-        // be if we don't do any set operations.
-        int sz = sets[end].size();
-        if (sz < smallestCount) {
-          smallestCount = sz;
-          smallestIndex = end;
-          answer = sets[end];
+        // keep track of the smallest positive set; use "answer" for this.
+        if (answer == null) {
+          answer = docSet;
+          continue;
         }
+        // note: assume that size() is cached.  It generally comes from the 
cache, so should be.
+        if (docSet.size() < answer.size()) {
+          // swap answer & docSet so that answer is smallest
+          DocSet tmp = answer;
+          answer = docSet;
+          docSet = tmp;
+        }
+        neg[end] = false;
       } else {
         neg[end] = true;
       }
+      sets[end++] = docSet;
+    } // end of queries
 
-      end++;
-    }
+    if (end > 0) {
+      // Are all of our normal cached filters negative?
+      if (answer == null) {
+        answer = getLiveDocSet();
+      }
 
-    // Are all of our normal cached filters negative?
-    if (end > 0 && answer == null) {
-      answer = getLiveDocSet();
-    }
+      // This optimizes for the case where we have more than 2 filters and 
instead
+      // of copying the bitsets we make one mutable bitset. We should only do 
this
+      // for BitDocSet since it clones the backing bitset for andNot and 
intersection.
+      if (end > 1 && answer instanceof BitDocSet) {
+        answer = MutableBitDocSet.fromBitDocSet((BitDocSet) answer);
+      }
 
-    // do negative queries first to shrink set size
-    for (int i = 0; i < end; i++) {
-      if (neg[i]) answer = answer.andNot(sets[i]);
-    }
+      // do negative queries first to shrink set size
+      for (int i = 0; i < end; i++) {
+        if (neg[i]) answer = answer.andNot(sets[i]);
+      }
+
+      for (int i = 0; i < end; i++) {
+        if (!neg[i]) answer = answer.intersection(sets[i]);
+      }
 
-    for (int i = 0; i < end; i++) {
-      if (!neg[i] && i != smallestIndex) answer = answer.intersection(sets[i]);
+      // Make sure to keep answer as an immutable DocSet if we made it mutable
+      answer = MutableBitDocSet.unwrapIfMutable(answer);
     }
 
     // ignore "answer" if it simply matches all docs

Review Comment:
   Out of scope of this PR, but while looking at this code: will 
`answer.size()` always be called downstream of this? If so, then it's fine to 
force size calculation here; if _not_, I think there should be better ways to 
check this condition than naively calling `answer.size()` after all 
intersections are done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [solr] magibney commented on a diff in pull request #1184: SOLR-16555: SolrIndexSearcher - FilterCache intersections/andNot should not clone bitsets repeatedly

Reply via email to