[
https://issues.apache.org/jira/browse/LUCENE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6024:
---------------------------------
Attachment: LUCENE-6024.patch
Here is a patch
- BitSet.and and BitSet.andNot have been improved to perform a leap-frog
intersection of the two bit sets. I think this is a better default impl as it
performs faster if any of the 2 sets is sparse.
- SparseFixedBitSet.or adds specialization for two common cases: union with a
rather dense set (used in practice when the cost is greater than maxDoc / 4096)
and union with another SparseFixedBitSet
- SparseFixedBitSet.and adds a minor specialization for the union with another
SparseFixedBitSet
- it also fixes a bug that SparseFixedBitSet.clear didn't update the count of
non-zero longs (which is used to compute the approximate cardinality).
I also changed a bit the API/impl to:
- not exhaust the iterator eg. in the FixedBitSet specialization. Not all
these bulk methods require to completely exhaust the iterator (eg.
intersection), so I rather documented that the state of the iterator is
undefined after these methods have been called
- require an unpositionned iterator: are there really use-cases for
intersection/union with an iteration that was already half consumed? Adding
this additional requirement makes things a bit simpler since you don't need to
check if the current doc is -1 for certain specialization or if the iterator is
not already on NO_MORE_DOCS
The main benefits are for SparseFixedBitSet's build time from another
DocIdSetIterator since it uses BitSet.or. Here are the numbers reported by
DocIdSetBenchmark (can be found in luceneutil) which measures how many
instances can be built in one second based on the density of the set (maxDoc is
hard-coded to 2^24). For reference, the SparseFixedBitSet is built from a
RoaringDocIdSet (since it is our fastest DocIdSet iteration-wise).
|| Set density || Without the patch || With the patch ||
| 10e-5 | 174335 | 162070 |
| 10e-4 | 28253 | 26357 |
| 10e-3 | 2569 | 4148 |
| 0.01 | 303 | 520 |
| 0.1 | 39 | 56 |
| 0.5 | 10 | 13 |
| 0.9 | 7 | 9 |
| 0.99 | 7 | 9 |
| 1 | 7 | 9 |
> Improve oal.util.BitSet's bulk and/or/and_not
> ---------------------------------------------
>
> Key: LUCENE-6024
> URL: https://issues.apache.org/jira/browse/LUCENE-6024
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-6024.patch
>
>
> LUCENE-6021 introduced oal.util.BitSet with default impls taken from
> FixedBitSet. However, these default impls could be more efficient (and eg.
> perform an actual leap frog for AND and AND_NOT).
> Additionally, SparseFixedBitSet could benefit from some specialization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]