IntHashSet, SentinelIntSet, SortedIntDocSet, ...

Smiley, David W. Wed, 30 Jan 2013 12:12:34 -0800

I'm looking for some advise on data structures, perhaps resulting in 
moving/consolidating some existing ones used across Lucene/Solr.


I'm working on a Filter implementation in the spatial module that needs to 
collect a set of docids.  I could use new OpenBitSet(maxdoc) (or FixedBitSet) 
of course, however the algorithm I have is recursive and at each level down it 
could potentially have the need to create another one of these.  That uses a 
lot of memory and isn't GC friendly.  And it's also a waste because very few 
documents are expected to match this filter.  So it appears that a hash based 
int set would be the right data structure.  I'm pretty sure that the docids are 
always looked up in a sequential order, and that suggests an implementation 
based on a sorted array of integers.  There's some cool code in Solr: 
SortedIntDocSet that should mostly work well.  Remember that Lucene spatial is 
a Lucene module and only depends on Lucene-core, and the queries module.  Well 
it depends on Spatial4j too but that's not relevant.


  *   SentinalIntSet (lucene-core):  Looks nice, however there isn't a set 
iterator.  I need to merge these sets and thus I need to iterate values.  The 
underlying storage is one int array.  By default the hash is the int itself, 
which I like as its fast and I'm not concerned with collissions.
  *   IntHashSet (facet module): This implementation has the value iterator I 
need.  The underlying storage has three int arrays making it considerably more 
bulky than SentinalIntSet's one. Its excessive, I think.  By default the hash 
appears to be "key & hashFactor" which should be fast.
  *   SortedIntDocSet (Solr):  Sophisticated algorithms on the sorted array 
where you assume sequential id lookups.  Nice!   But it's intertwined with some 
other Solr classes and it has more code weight to it than the other two classes 
and for stuff I don't think I need.
  *   IntOpenHashSet (HPPC — CarrotSearch labs):  Looks good.  Backed by one 
int array and one boolean array.  I like this implementation better than 
IntHashSet and SentinalIntSet.  However it computes a hash of the id using 
MurMurHash3, which I think is overkill, and the API doesn't have a subclassing 
opportunity for me to use something else.

The simplest solution is to migrate IntHashSet from the facet module to Lucene 
core.  I'm tempted to create a new class which borrows heavily from 
SortedIntDocSet.  But that's more work and demands that I benchmark things to 
see if it's worth the extra development time, and I don't have time for that 
now.

Thoughts?

~ David

IntHashSet, SentinelIntSet, SortedIntDocSet, ...

Reply via email to