[jira] [Commented] (LUCENE-8885) Optimise BKD reader by exploiting cardinality information stored on leaves

2019-06-30 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875943#comment-16875943
 ] 

ASF subversion and git services commented on LUCENE-8885:
-

Commit f5991d96bab4297b2c975dc2fb1a8124b91f34b6 in lucene-solr's branch 
refs/heads/branch_8x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f5991d9 ]

LUCENE-8885: Optimise BKD reader by exploiting cardinality information stored 
on leaves (#746)

The commit adds the method InstersectVisitor#visit(DocIdSetIterator, byte[]).

> Optimise BKD reader by exploiting cardinality information stored on leaves
> --
>
> Key: LUCENE-8885
> URL: https://issues.apache.org/jira/browse/LUCENE-8885
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
> duplicated points. In such case the points are stored together with the 
> cardinality. We still call the IntersectVisitor once per document therefore 
> we are checking many times the same point agains the query. The idea is to 
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit 
> that property we need to either change the API or extend it. Here are the 
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
> by the following method:
> {code:java}
>  /** Called for leaf cells that intersects the leaf to test if the point  
>  matches to the query
>  * In case it matches, the implementor must call {@link 
> IntersectVisitor#visit(int)} with the
>  * documents associated with this point are visited */
> boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if 
> true then Coll the method IntersectVisitor#visit(int) for all documents 
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to 
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor 
> interface:
> {code:java}
>  /** Called for documents in a leaf cell that crosses the query.  The consumer
>  *  should scrutinize the packedValue to decide whether to accept it.  If 
> accepted it should
>  *  consider only the {@code numberDocs} documents starting at {@code 
> offset} In the 1D case,
>  *  values are visited in increasing order, and in the case of ties, in 
> increasing
>  *  docID order. */
> default void visit(int[] docID, int offset, int numberDocs, byte[] 
> packedValue) throws IOException {
>   for ( int i =offset; i < offset + numberDocs; i++) {
> visit(docID[i], packedValue);
>   }
> }
> {code}
> The merit of this approach is that is backwards compatible and it is up to 
> the implementors to override this method and get the benefits for this 
> optimisation.The biggest downside is that it assumes that the codec has doc 
> IDs available in an int[] slice as opposed to streaming them from disk 
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to 
> hearing opining if we should do this change at all and if so, how to approach 
> it. My +1 goes to 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8885) Optimise BKD reader by exploiting cardinality information stored on leaves

2019-06-30 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875936#comment-16875936
 ] 

ASF subversion and git services commented on LUCENE-8885:
-

Commit db68634c67731d908d69d44232730ed92424e828 in lucene-solr's branch 
refs/heads/master from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=db68634 ]

LUCENE-8885: Optimise BKD reader by exploiting cardinality information stored 
on leaves (#746)

The commit adds the method InstersectVisitor#visit(DocIdSetIterator, byte[]).

> Optimise BKD reader by exploiting cardinality information stored on leaves
> --
>
> Key: LUCENE-8885
> URL: https://issues.apache.org/jira/browse/LUCENE-8885
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
> duplicated points. In such case the points are stored together with the 
> cardinality. We still call the IntersectVisitor once per document therefore 
> we are checking many times the same point agains the query. The idea is to 
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit 
> that property we need to either change the API or extend it. Here are the 
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
> by the following method:
> {code:java}
>  /** Called for leaf cells that intersects the leaf to test if the point  
>  matches to the query
>  * In case it matches, the implementor must call {@link 
> IntersectVisitor#visit(int)} with the
>  * documents associated with this point are visited */
> boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if 
> true then Coll the method IntersectVisitor#visit(int) for all documents 
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to 
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor 
> interface:
> {code:java}
>  /** Called for documents in a leaf cell that crosses the query.  The consumer
>  *  should scrutinize the packedValue to decide whether to accept it.  If 
> accepted it should
>  *  consider only the {@code numberDocs} documents starting at {@code 
> offset} In the 1D case,
>  *  values are visited in increasing order, and in the case of ties, in 
> increasing
>  *  docID order. */
> default void visit(int[] docID, int offset, int numberDocs, byte[] 
> packedValue) throws IOException {
>   for ( int i =offset; i < offset + numberDocs; i++) {
> visit(docID[i], packedValue);
>   }
> }
> {code}
> The merit of this approach is that is backwards compatible and it is up to 
> the implementors to override this method and get the benefits for this 
> optimisation.The biggest downside is that it assumes that the codec has doc 
> IDs available in an int[] slice as opposed to streaming them from disk 
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to 
> hearing opining if we should do this change at all and if so, how to approach 
> it. My +1 goes to 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8885) Optimise BKD reader by exploiting cardinality information stored on leaves

2019-06-26 Thread Ignacio Vera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873860#comment-16873860
 ] 

Ignacio Vera commented on LUCENE-8885:
--

I have opened a PR with [~jpountz] suggestion. 

> Optimise BKD reader by exploiting cardinality information stored on leaves
> --
>
> Key: LUCENE-8885
> URL: https://issues.apache.org/jira/browse/LUCENE-8885
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
> duplicated points. In such case the points are stored together with the 
> cardinality. We still call the IntersectVisitor once per document therefore 
> we are checking many times the same point agains the query. The idea is to 
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit 
> that property we need to either change the API or extend it. Here are the 
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
> by the following method:
> {code:java}
>  /** Called for leaf cells that intersects the leaf to test if the point  
>  matches to the query
>  * In case it matches, the implementor must call {@link 
> IntersectVisitor#visit(int)} with the
>  * documents associated with this point are visited */
> boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if 
> true then Coll the method IntersectVisitor#visit(int) for all documents 
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to 
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor 
> interface:
> {code:java}
>  /** Called for documents in a leaf cell that crosses the query.  The consumer
>  *  should scrutinize the packedValue to decide whether to accept it.  If 
> accepted it should
>  *  consider only the {@code numberDocs} documents starting at {@code 
> offset} In the 1D case,
>  *  values are visited in increasing order, and in the case of ties, in 
> increasing
>  *  docID order. */
> default void visit(int[] docID, int offset, int numberDocs, byte[] 
> packedValue) throws IOException {
>   for ( int i =offset; i < offset + numberDocs; i++) {
> visit(docID[i], packedValue);
>   }
> }
> {code}
> The merit of this approach is that is backwards compatible and it is up to 
> the implementors to override this method and get the benefits for this 
> optimisation.The biggest downside is that it assumes that the codec has doc 
> IDs available in an int[] slice as opposed to streaming them from disk 
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to 
> hearing opining if we should do this change at all and if so, how to approach 
> it. My +1 goes to 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8885) Optimise BKD reader by exploiting cardinality information stored on leaves

2019-06-26 Thread Ignacio Vera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873395#comment-16873395
 ] 

Ignacio Vera commented on LUCENE-8885:
--

With 1 you would have to track the last visited point and associate that with 
the corresponding call to visit(int docID) which is not nice and relies on the 
behaviour of the reader.

2 is more practical and ;less intrusive and using a DocIdSetIterator will make 
the API cleaner ++

> Optimise BKD reader by exploiting cardinality information stored on leaves
> --
>
> Key: LUCENE-8885
> URL: https://issues.apache.org/jira/browse/LUCENE-8885
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
> duplicated points. In such case the points are stored together with the 
> cardinality. We still call the IntersectVisitor once per document therefore 
> we are checking many times the same point agains the query. The idea is to 
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit 
> that property we need to either change the API or extend it. Here are the 
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
> by the following method:
> {code:java}
>  /** Called for leaf cells that intersects the leaf to test if the point  
>  matches to the query
>  * In case it matches, the implementor must call {@link 
> IntersectVisitor#visit(int)} with the
>  * documents associated with this point are visited */
> boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if 
> true then Coll the method IntersectVisitor#visit(int) for all documents 
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to 
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor 
> interface:
> {code:java}
>  /** Called for documents in a leaf cell that crosses the query.  The consumer
>  *  should scrutinize the packedValue to decide whether to accept it.  If 
> accepted it should
>  *  consider only the {@code numberDocs} documents starting at {@code 
> offset} In the 1D case,
>  *  values are visited in increasing order, and in the case of ties, in 
> increasing
>  *  docID order. */
> default void visit(int[] docID, int offset, int numberDocs, byte[] 
> packedValue) throws IOException {
>   for ( int i =offset; i < offset + numberDocs; i++) {
> visit(docID[i], packedValue);
>   }
> }
> {code}
> The merit of this approach is that is backwards compatible and it is up to 
> the implementors to override this method and get the benefits for this 
> optimisation.The biggest downside is that it assumes that the codec has doc 
> IDs available in an int[] slice as opposed to streaming them from disk 
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to 
> hearing opining if we should do this change at all and if so, how to approach 
> it. My +1 goes to 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8885) Optimise BKD reader by exploiting cardinality information stored on leaves

2019-06-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873323#comment-16873323
 ] 

Adrien Grand commented on LUCENE-8885:
--

I don't think backward compatibility should be a concern for such low-level 
APIs. I like 1 but then we wouldn't have an API that we could use for merges 
anymore since we wouldn't have a reliable way to know which byte[] maps to 
which docID(s)? Maybe 2 is more practical, and we could make it look a bit 
nicer by replacing the int[] with a DocIdSetIterator, ie. {{void 
visit(DocIdSetIterator docs, byte[] packedValue) throws IOException}}?

> Optimise BKD reader by exploiting cardinality information stored on leaves
> --
>
> Key: LUCENE-8885
> URL: https://issues.apache.org/jira/browse/LUCENE-8885
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>
> In LUCENE-8688 it was introduce a new storing strategy for leaves contains 
> duplicated points. In such case the points are stored together with the 
> cardinality. We still call the IntersectVisitor once per document therefore 
> we are checking many times the same point agains the query. The idea is to 
> check the point once and then add all the documents.
> The API of the IntersectVisitor does not allow that, and therefore to exploit 
> that property we need to either change the API or extend it. Here are the 
> possibilities I can think of:
> 1) Modify the API by replacing the method IntersectVisitor#visit(byte[], int) 
> by the following method:
> {code:java}
>  /** Called for leaf cells that intersects the leaf to test if the point  
>  matches to the query
>  * In case it matches, the implementor must call {@link 
> IntersectVisitor#visit(int)} with the
>  * documents associated with this point are visited */
> boolean matches(byte[] packedValue) throws IOException;
> {code}
> This will allow the BKD reader to check if a point matches the query and if 
> true then Coll the method IntersectVisitor#visit(int) for all documents 
> associated with that point.
> The drawback of this approach is backwards compatibility and the need to 
> update all classes implement this interface.
> 2) Extends the API by adding a new default method in the IntersectVisitor 
> interface:
> {code:java}
>  /** Called for documents in a leaf cell that crosses the query.  The consumer
>  *  should scrutinize the packedValue to decide whether to accept it.  If 
> accepted it should
>  *  consider only the {@code numberDocs} documents starting at {@code 
> offset} In the 1D case,
>  *  values are visited in increasing order, and in the case of ties, in 
> increasing
>  *  docID order. */
> default void visit(int[] docID, int offset, int numberDocs, byte[] 
> packedValue) throws IOException {
>   for ( int i =offset; i < offset + numberDocs; i++) {
> visit(docID[i], packedValue);
>   }
> }
> {code}
> The merit of this approach is that is backwards compatible and it is up to 
> the implementors to override this method and get the benefits for this 
> optimisation.The biggest downside is that it assumes that the codec has doc 
> IDs available in an int[] slice as opposed to streaming them from disk 
> directly to the IntersectVisitor for instance as [~jpountz] noted.
> Maybe there are more options I did not think about so looking forward to 
> hearing opining if we should do this change at all and if so, how to approach 
> it. My +1 goes to 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org