[jira] [Commented] (LUCENE-5081) Compress doc ID sets

Adrien Grand (JIRA) Wed, 10 Jul 2013 10:37:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704798#comment-13704798
 ]


Adrien Grand commented on LUCENE-5081:
--------------------------------------

New patch:
 - renamed implementation to WAH8DocIdSet
 - added an index in order to be able to advance() in logarithmic time, this 
works pretty much like the old terms index impl by storing the position and doc 
ID encoded at every n-th sequence and then using binary search to find 
somewhere before the target and close to it,
 - even with the index, WAH8DocIdSet is never larger than FixedBitSet by more 
than 2% (even when the index interval is 8, which is the lowest accepted value 
in the current impl),
 - factored some code out of BitVector and OpenBitSetIterator into BitUtil.

I haven't wired this set implementation anywhere yet but I think always being 
less than 2% smaller than FixedBitSet and being able to advance in logarithmic 
time are nice properties so I'm pretty sure some people will be interested in 
using it for their caches. I'm waiting for the other implementations to get 
in/improve (eg. when EliasFanoDocIdSet will have an index) to write more 
detailed benchmarks to compare speed and memory efficiency of the impls we have 
for our caches (Elias-Fano, WAH8, FixedBitSet so far, maybe something based on 
PFOR-delta soon too).

Please let me know if you would like to review this patch. Otherwise I will 
commit it soon.
                
> Compress doc ID sets
> --------------------
>
>                 Key: LUCENE-5081
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5081
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-5081.patch
>
>
> Our filters use bit sets a lot to store document IDs. However, it is likely 
> that most of them are sparse hence easily compressible. Having efficient 
> compressed sets would allow for caching more data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5081) Compress doc ID sets

Reply via email to