if you would drive this example further in combination with flex-indexing 
permitting per term postings format, I could imagine some nice tools for 
optimizeHard() , where normal index construction works with defaults as planned 
for solid mix-performance case and at the end you run optimizeHard() where 
postings get resorted on such fields (basically enabling rle encoding to work) 
and at the same time all other terms get optimal encoding format for 
postings... perfect for read only indexes where you want to max performance and 
reduce ix size


>
>From: eks dev <eks...@yahoo.co.uk>
>To: java-dev@lucene.apache.org
>Sent: Tuesday, 6 October, 2009 23:59:12
>Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
>
>
>Paul,
>the point I was trying to make with this example was extreme,  but realistic. 
>Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
>selects 40Mio of them (user rights...). To encode this, you need format with  
>two integers (for more of such intervals you would need slightly more, but 
>nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
>speaking this term is dense, but highly compressible and could be inlined with 
>pulsing trick...
>
>cheers, eks  
>
>
>
>
>>
>>From: Paul Elschot <paul.elsc...@xs4all.nl>
>>To: java-dev@lucene.apache.org
>>Sent: Tuesday, 6 October, 2009 23:33:03
>>Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
>>
>>Eks,
>>
>>
>>> 
>>>>>     [ 
>>>>> https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742
>>>>>  ] 
>>>>> 
>>>>> Eks Dev commented on LUCENE-1410:
>>>>> ---------------------------------
>>>>> 
>>>>> Mike, 
>>>>> That is definitely the way to go, distribution dependent encoding, where 
>>>>> every Term gets individual treatment.
>>>>> 
>>>>> Take for an example simple, but not all that rare case where Index gets 
>>>>> sorted on some of the indexed fields (we use it really extensively, e.g. 
>>>>> presorted doc collection on user_rights/zip/city, all indexed). There you 
>>>>> get perfectly "compressible"  postings by simply managing intervals of 
>>>>> set bits. Updates distort this picture, but we rebuild index periodically 
>>>>> and all gets good again.  At the moment we load them into RAM as Filters 
>>>>> in IntervalSets. if that would be possible in lucene, we wouldn't bother 
>>>>> with Filters (VInt decoding on such super dense fields was killing us, 
>>>>> even in RAMDirectory) ... 
>>
>>
>>You could try switching the Filter to OpenBitSet when that takes fewer bytes 
>>than SortedVIntList.
>>
>>
>>Regards,
>>>>Paul Elschot
>>
>>
>>
>


      

Reply via email to