It is faster than BitSet, even against Mustang. The numbers are a bit less 
than on Yonik’s HW, but quite convincing. I did small test on my XP Notebook 
(Pentium M 1.6GHz). Only “union” test is some 20% slower on 8Mio size with 80k 
bits set. I did not dig deeper. 
   
  As much as it is worth, my proposal would be to hide all usages of BitSet 
(there is also one BitVector around?) inside Lucene behind appropriate 
interfaces in order to leverage pearls like this one. Actually it should not be 
too complicated, the only ugly back compatibility issue I could identify is in 
Filter, but for this Yonik, Hoss and Paul already made rather acceptable 
extend/deprecate plans.
   
  Maybe separate package for various BitSet / IntegerSet implementation would 
not be such a bad idea as there is no single best implementation? Just let me 
remind on what we have around:
   
  BitSet (OpenBitSet as faster variant of the more or less the same)
  HashIntegerSet 
  IntArraySotredIntList
  VIntSortedIntList
  TreeBasedIntSet (I am trying to make my own implementation inspired by 
http://www.iis.uni-stuttgart.de/intset/ )
  More?
   
  I could easily imagine to put all these together behind simple interfaces and 
to provide some Factories and a few utility methods with a bit of intelligence 
(for example, CheekyBitSet that collects document ids in some small int[] and 
if it fits, keep them there, if not switch automatically to some compact 
representation)…
   
  Implementations above cover practically all important cases for 
caching/Filtering and bring some extra speed/memory for other uses: 
  OpenBitSet (referent, “general workhorse”): 
  Fast random access, does not grow, high memory demands for sparse BitsSets, 
reasonably fast Iterator over set bits
  HashIntegerSet:
       Slower random access, very fast Iterator possible, low memory demands on 
very sparse bitsets.
  IntArraySotredIntList:
       Slow random access, extremely fast Iterator over set bits, low memory 
demands on sparse bit sets, fast populating if adding elements that are sorted… 
  VIntSortedIntList:
    Almost the same as IntArraySotredIntList, but random access almost not 
possible, but Memory demands very low, very useful for moderately sparse bit 
sets. 
  TreeBasedIntSet:
     The best general performance for sparse bit sets, grows dynamically, 
exploits runs of clear/set bits to reduce memory demands. Extremly fast 
Iterator over set  bits  for low to medium sparse bit sets, acceptable random 
access. 
   
  Considering Zipf distribution curse in almost all cases I have ever seen 
(except for “pure” filtering fields like language, categories…), having new 
BitSet(MAX_DOCUMENT) looks like pure luxury (e.g. in ChainedFilter). As you all 
know Big majority of bit vectors usually met in practice are very to modestly 
sparse due to Zipf… 
   
  Did I say it before, this code Yonik wrote is way above my head, I just tried 
to summarize a few thoughts from “library user” perspective as it seams to me 
that caching and this bit twiddling is next major performance/scalability 
milestone for Lucene.  
  

----- Original Message ----
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Cc: java-dev@lucene.apache.org
Sent: Friday, 12 May, 2006 10:29:24 PM
Subject: Re: OpenBitSet

Code is here for those interested:
http://issues.apache.org/jira/browse/SOLR-15

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to