"What's the point of using a sorted interval list for a category?"

Just terminology first to avoid misunderstanding :), category is "category 
field" that can take N valus

Now, the case I am facing goes as follows:

I have category field in 50Mio collection which has more or less uniform 
distribution of values, 
imagine 100 values (terms) each with 0.5Mio documents. when I sort index on it, 
bit vectors look like
00000000......1111111111....00000000

So, if I cache this in any other representation (Filter usage, no scoring 
contribution as typical for categories) this adds up in memory rather quickly. 
With interval list for Filter/Matcher I need 2 Ints to store cache entry per 
value/term and have really fast next()-SkipTo(). In total 200 Ints for sorted 
category field with 100 values....

Not an unusual case I guess, a lot of people can sort their indexes, I guess at 
least.

Of course, it is a bit simplified, as we should not forget dynamic nature of 
the index. In real life, you can afford to sort your index once per month or 
once per week, but small percentage of updates do not introduce significant 
problems for interval lists.

Does this make sense to you, or I misread your aproach?


What woud be sexy to do, is to use such representation for postings storage, so 
each postings list would have type field that determines optimal bit vector 
representation for the term (high level, did not look into it yet in details). 
So instead of VInts, we could have interval lists for such cases. This would be 
 equivalent to RLE  encoding... which can be done fast I guess. This field type 
could be determined in some kind of *oprional* post-optimize proces that scanns 
postings and tries to find optimal "compresion", only VInt, VInt with rle for 
"predominantly sorted" postings....  ANd this is not 
limited to rle only, I could imagine some smart people to come up with even 
smarter things...


It looks like good idea to me, but I would not be surprised if somebody proves 
that I missed the point totaly :)



 







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to