[jira] [Comment Edited] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

Feng Guo (Jira) Tue, 16 Nov 2021 05:41:34 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444527#comment-17444527
 ]


Feng Guo edited comment on LUCENE-10233 at 11/16/21, 1:40 PM:
--------------------------------------------------------------

[~jpountz] Sadly, I found that the implementation of {{SparseFixedBitSet}} is 
much more complicated than I thought, and it seems that the {{or}} operation of 
{{FixedBitSet}} and {{SparseFixedBitSet}} can not be simplely solved by {{|}} . 
I need to spend some more time to consider how to implement this algorithm, and 
I'm not sure if that can be as efficient as before.

Considering the new challenges, I made some new 
[changes|[https://github.com/apache/lucene/pull/438/commits/292e4ed43119832d626506336ed61152f733a431]]
 in my original approach: I added extra 0 words for the bitset in the original 
method. And i created a new expert interface to get the bitset regardless of 
the docBase. Generally, users can simply use the old interface to get a bitset 
because the content represented by the bitset is consistent with the 
idSetIterator. I wonder if this apporach can solve your worries?


was (Author: gf2121):
[~jpountz] Sadly, I found that the implementation of {{SparseFixedBitSet}} is 
much more complicated than I thought, and it seems that the {{or}} operation of 
{{FixedBitSet}} and {{SparseFixedBitSet}} can not be simplely solved by {{|}} . 
I need to spend some more time to consider how to implement this algorithm, and 
I'm not sure if that can be as efficient as before.

Considering the new challenges, I made some new 
[changes|[https://github.com/apache/lucene/pull/438/commits/292e4ed43119832d626506336ed61152f733a431]]
 in my original approach: I considered docBase in the original method, adding 
extra 0 words for them. And i created a new expert interface to get the bitset 
regardless of the docBase. Generally, users can simply use the old interface to 
get a bitset because the content represented by the bitset is consistent with 
the idSetIterator. I wonder if this apporach can solve your worries?

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> ------------------------------------------------------------------
>
>                 Key: LUCENE-10233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10233
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Feng Guo
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

Reply via email to