Ignacio Vera created LUCENE-8885:
------------------------------------

             Summary: Optimise BKD reader by exploiting cardinality information 
stored on leaves
                 Key: LUCENE-8885
                 URL: https://issues.apache.org/jira/browse/LUCENE-8885
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Ignacio Vera


In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
duplicated points. In such case the points are stored together with the 
cardinality. We still call the IntersectVisitor once per document therefore we 
are checking many times the same point agains the query. The idea is to check 
the point once and then add all the documents.

The API of the IntersectVisitor does not allow that, and therefore to exploit 
that property we need to either change the API or extend it. Here are the 
possibilities I can think of:

1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
by the following method:
{code:java}
     /** Called for leaf cells that intersects the leaf to test if the point   
matches to the query
     * In case it matches, the implementor must call {@link 
IntersectVisitor#visit(int)} with the
     * documents associated with this point are visited */
    boolean matches(byte[] packedValue) throws IOException;
{code}
This will allow the BKD reader to check if a point matches the query and if 
true then Coll the method IntersectVisitor#visit(int) for all documents 
associated with that point.
The drawback of this approach is backwards compatibility and the need to update 
all classes implement this interface.


2) Extends the API by adding a new default method in the IntersectVisitor 
interface:
{code:java}
 /** Called for documents in a leaf cell that crosses the query.  The consumer
     *  should scrutinize the packedValue to decide whether to accept it.  If 
accepted it should
     *  consider only the {@code numberDocs} documents starting at {@code 
offset} In the 1D case,
     *  values are visited in increasing order, and in the case of ties, in 
increasing
     *  docID order. */
    default void visit(int[] docID, int offset, int numberDocs, byte[] 
packedValue) throws IOException {
      for ( int i =offset; i < offset + numberDocs; i++) {
        visit(docID[i], packedValue);
      }
    }
{code}

The merit of this approach is that is backwards compatible and it is up to the 
implementors to override this method and get the benefits for this 
optimisation.The biggest downside is that it assumes that the codec has doc IDs 
available in an int[] slice as opposed to streaming them from disk directly to 
the IntersectVisitor for instance as [~jpountz] noted.


Maybe there are more options I did not think about so looking forward to 
hearing opining if we should do this change at all and if so, how to approach 
it. My +1 goes to 1).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to