[ 
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873860#comment-16873860
 ] 

Ignacio Vera commented on LUCENE-8885:
--------------------------------------

I have opened a PR with [~jpountz] suggestion. 

> Optimise BKD reader by exploiting cardinality information stored on leaves
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-8885
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8885
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
> duplicated points. In such case the points are stored together with the 
> cardinality. We still call the IntersectVisitor once per document therefore 
> we are checking many times the same point agains the query. The idea is to 
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit 
> that property we need to either change the API or extend it. Here are the 
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
> by the following method:
> {code:java}
>      /** Called for leaf cells that intersects the leaf to test if the point  
>  matches to the query
>      * In case it matches, the implementor must call {@link 
> IntersectVisitor#visit(int)} with the
>      * documents associated with this point are visited */
>     boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if 
> true then Coll the method IntersectVisitor#visit(int) for all documents 
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to 
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor 
> interface:
> {code:java}
>  /** Called for documents in a leaf cell that crosses the query.  The consumer
>      *  should scrutinize the packedValue to decide whether to accept it.  If 
> accepted it should
>      *  consider only the {@code numberDocs} documents starting at {@code 
> offset} In the 1D case,
>      *  values are visited in increasing order, and in the case of ties, in 
> increasing
>      *  docID order. */
>     default void visit(int[] docID, int offset, int numberDocs, byte[] 
> packedValue) throws IOException {
>       for ( int i =offset; i < offset + numberDocs; i++) {
>         visit(docID[i], packedValue);
>       }
>     }
> {code}
> The merit of this approach is that is backwards compatible and it is up to 
> the implementors to override this method and get the benefits for this 
> optimisation.The biggest downside is that it assumes that the codec has doc 
> IDs available in an int[] slice as opposed to streaming them from disk 
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to 
> hearing opining if we should do this change at all and if so, how to approach 
> it. My +1 goes to 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to