On Tuesday 06 October 2009 23:59:12 eks dev wrote: > Paul, > the point I was trying to make with this example was extreme, but realistic. > Imagine 100Mio docs, sorted on field user_rights, a term user_rights:XX > selects 40Mio of them (user rights...). To encode this, you need format with > two integers (for more of such intervals you would need slightly more, but > nevertheless, much less than for OpenBitSet, VInts, PFor... ). Strictly > speaking this term is dense, but highly compressible and could be inlined > with pulsing trick...
Well, I've been considering to add compressed consecutive ranges to SortedVIntList, but I did not get further than considering. This sounds like the perfect use case for that. Regards, Paul Elschot > > cheers, eks > > > > > > > >From: Paul Elschot <paul.elsc...@xs4all.nl> > >To: java-dev@lucene.apache.org > >Sent: Tuesday, 6 October, 2009 23:33:03 > >Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation > > > >Eks, > > > > > >> > >>> [ > >>> https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742 > >>> ] > >>> > >>> Eks Dev commented on LUCENE-1410: > >>> --------------------------------- > >>> > >>> Mike, > >>> That is definitely the way to go, distribution dependent encoding, where > >>> every Term gets individual treatment. > >>> > >>> Take for an example simple, but not all that rare case where Index gets > >>> sorted on some of the indexed fields (we use it really extensively, e.g. > >>> presorted doc collection on user_rights/zip/city, all indexed). There you > >>> get perfectly "compressible" postings by simply managing intervals of > >>> set bits. Updates distort this picture, but we rebuild index periodically > >>> and all gets good again. At the moment we load them into RAM as Filters > >>> in IntervalSets. if that would be possible in lucene, we wouldn't bother > >>> with Filters (VInt decoding on such super dense fields was killing us, > >>> even in RAMDirectory) ... > > > > > >You could try switching the Filter to OpenBitSet when that takes fewer bytes > >than SortedVIntList. > > > > > >Regards, > >>Paul Elschot > > > > > > > > >