It is faster than BitSet, even against Mustang. The numbers are a bit less than on Yonik’s HW, but quite convincing. I did small test on my XP Notebook (Pentium M 1.6GHz). Only “union” test is some 20% slower on 8Mio size with 80k bits set. I did not dig deeper. As much as it is worth, my proposal would be to hide all usages of BitSet (there is also one BitVector around?) inside Lucene behind appropriate interfaces in order to leverage pearls like this one. Actually it should not be too complicated, the only ugly back compatibility issue I could identify is in Filter, but for this Yonik, Hoss and Paul already made rather acceptable extend/deprecate plans. Maybe separate package for various BitSet / IntegerSet implementation would not be such a bad idea as there is no single best implementation? Just let me remind on what we have around: BitSet (OpenBitSet as faster variant of the more or less the same) HashIntegerSet IntArraySotredIntList VIntSortedIntList TreeBasedIntSet (I am trying to make my own implementation inspired by http://www.iis.uni-stuttgart.de/intset/ ) More? I could easily imagine to put all these together behind simple interfaces and to provide some Factories and a few utility methods with a bit of intelligence (for example, CheekyBitSet that collects document ids in some small int[] and if it fits, keep them there, if not switch automatically to some compact representation)… Implementations above cover practically all important cases for caching/Filtering and bring some extra speed/memory for other uses: OpenBitSet (referent, “general workhorse”): Fast random access, does not grow, high memory demands for sparse BitsSets, reasonably fast Iterator over set bits HashIntegerSet: Slower random access, very fast Iterator possible, low memory demands on very sparse bitsets. IntArraySotredIntList: Slow random access, extremely fast Iterator over set bits, low memory demands on sparse bit sets, fast populating if adding elements that are sorted… VIntSortedIntList: Almost the same as IntArraySotredIntList, but random access almost not possible, but Memory demands very low, very useful for moderately sparse bit sets. TreeBasedIntSet: The best general performance for sparse bit sets, grows dynamically, exploits runs of clear/set bits to reduce memory demands. Extremly fast Iterator over set bits for low to medium sparse bit sets, acceptable random access. Considering Zipf distribution curse in almost all cases I have ever seen (except for “pure” filtering fields like language, categories…), having new BitSet(MAX_DOCUMENT) looks like pure luxury (e.g. in ChainedFilter). As you all know Big majority of bit vectors usually met in practice are very to modestly sparse due to Zipf… Did I say it before, this code Yonik wrote is way above my head, I just tried to summarize a few thoughts from “library user” perspective as it seams to me that caching and this bit twiddling is next major performance/scalability milestone for Lucene.
----- Original Message ---- From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-dev@lucene.apache.org Cc: java-dev@lucene.apache.org Sent: Friday, 12 May, 2006 10:29:24 PM Subject: Re: OpenBitSet Code is here for those interested: http://issues.apache.org/jira/browse/SOLR-15 -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]