Yeah, good hint. We actually made such measurements on TreeIntegerSet 
implementation, and it is totally astonishing what you get as a result (I 
remember 6Meg against 2k Memory consumption for "predominantly sorted bit 
vectors" like zip codes, conjuction/disjunct speed oreder of magnitude faster 
as it walks shallow tree in that case). If you have any posibility to sort your 
indexes, do so, even Lucene on disk representation appreciates this I guess 
(skips are faster, bit vectors on disk better compressed/decompresed?) 
 
We even made one small visualizer of bit vectors that visualizes (generates 
image) HitCollector results for any specified query (gray image where every 
pixel represents 8-32 succesive bits from bit vector higher density=>darker 
color ). I like to see the enemy first.  
 
When we are allready in this area, just a curiosity,  friend of mine has one 
head spinning idea, to utilize graphics card HW to do super fast bit vector 
operations.  These thingies today are really optimized for basic bit 
operations. I am just curious to see what he comes up with. 
 
I hope I will have some time next week or so to polish some tests for 
OpenBitSet a bit and drop it somewhere on Jira if anybody has interest to play 
with.

A bit off  topic, is there anybody who is doing ChainedFilter version that uses 
docNrSkipper? As I recall, you wrote BitSet version :)
 
----- Original Message ----
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; eks dev <[EMAIL PROTECTED]>
Sent: Tuesday, 16 May, 2006 8:13:53 PM
Subject: Re: OpenBitSet


: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.

I was thinking the same thing ... i just haven't had time to play with it.

It migh also be usefull to check how the distribution of the set bits
affects things -- i suspect that for some "Filters" there some amount of
clustering as many people index their documents in a particular order, and
then filter on ranges of that order (ie: index documents as they are
created, and then filtering on create date) ... using
Random.nextGaussian() to pick which bets to set might be interesting.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to