[ 
https://issues.apache.org/jira/browse/LUCENE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-6024:
---------------------------------
    Attachment: LUCENE-6024.patch

Here is a patch
 - BitSet.and and BitSet.andNot have been improved to perform a leap-frog 
intersection of the two bit sets. I think this is a better default impl as it 
performs faster if any of the 2 sets is sparse.
 - SparseFixedBitSet.or adds specialization for two common cases: union with a 
rather dense set (used in practice when the cost is greater than maxDoc / 4096) 
and union with another SparseFixedBitSet
 - SparseFixedBitSet.and adds a minor specialization for the union with another 
SparseFixedBitSet
 - it also fixes a bug that SparseFixedBitSet.clear didn't update the count of 
non-zero longs (which is used to compute the approximate cardinality).

I also changed a bit the API/impl to:
 - not exhaust the iterator eg. in the FixedBitSet specialization. Not all 
these bulk methods require to completely exhaust the iterator (eg. 
intersection), so I rather documented that the state of the iterator is 
undefined after these methods have been called
 - require an unpositionned iterator: are there really use-cases for 
intersection/union with an iteration that was already half consumed? Adding 
this additional requirement makes things a bit simpler since you don't need to 
check if the current doc is -1 for certain specialization or if the iterator is 
not already on NO_MORE_DOCS

The main benefits are for SparseFixedBitSet's build time from another 
DocIdSetIterator since it uses BitSet.or. Here are the numbers reported by 
DocIdSetBenchmark (can be found in luceneutil) which measures how many 
instances can be built in one second based on the density of the set (maxDoc is 
hard-coded to 2^24). For reference, the SparseFixedBitSet is built from a 
RoaringDocIdSet (since it is our fastest DocIdSet iteration-wise).

|| Set density || Without the patch || With the patch ||
| 10e-5 | 174335 | 162070 |
| 10e-4 | 28253 | 26357 |
| 10e-3 | 2569 | 4148 |
| 0.01 | 303 | 520 |
| 0.1 | 39 | 56 |
| 0.5 | 10 | 13 |
| 0.9 | 7 | 9 |
| 0.99 | 7 | 9 |
| 1 | 7 | 9 |

> Improve oal.util.BitSet's bulk and/or/and_not
> ---------------------------------------------
>
>                 Key: LUCENE-6024
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6024
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-6024.patch
>
>
> LUCENE-6021 introduced oal.util.BitSet with default impls taken from 
> FixedBitSet. However, these default impls could be more efficient (and eg. 
> perform an actual leap frog for AND and AND_NOT).
> Additionally, SparseFixedBitSet could benefit from some specialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to