Paul,
the point I was trying to make with this example was extreme,  but realistic. 
Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
selects 40Mio of them (user rights...). To encode this, you need format with  
two integers (for more of such intervals you would need slightly more, but 
nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
speaking this term is dense, but highly compressible and could be inlined with 
pulsing trick...

cheers, eks  




>
>From: Paul Elschot <paul.elsc...@xs4all.nl>
>To: java-dev@lucene.apache.org
>Sent: Tuesday, 6 October, 2009 23:33:03
>Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
>
>Eks,
>
>
>> 
>>>     [ 
>>> https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742
>>>  ] 
>>> 
>>> Eks Dev commented on LUCENE-1410:
>>> ---------------------------------
>>> 
>>> Mike, 
>>> That is definitely the way to go, distribution dependent encoding, where 
>>> every Term gets individual treatment.
>>> 
>>> Take for an example simple, but not all that rare case where Index gets 
>>> sorted on some of the indexed fields (we use it really extensively, e.g. 
>>> presorted doc collection on user_rights/zip/city, all indexed). There you 
>>> get perfectly "compressible"  postings by simply managing intervals of set 
>>> bits. Updates distort this picture, but we rebuild index periodically and 
>>> all gets good again.  At the moment we load them into RAM as Filters in 
>>> IntervalSets. if that would be possible in lucene, we wouldn't bother with 
>>> Filters (VInt decoding on such super dense fields was killing us, even in 
>>> RAMDirectory) ... 
>
>
>You could try switching the Filter to OpenBitSet when that takes fewer bytes 
>than SortedVIntList.
>
>
>Regards,
>>Paul Elschot
>
>
>


      

Reply via email to