I'm looking for some advise on data structures, perhaps resulting in moving/consolidating some existing ones used across Lucene/Solr.
I'm working on a Filter implementation in the spatial module that needs to collect a set of docids. I could use new OpenBitSet(maxdoc) (or FixedBitSet) of course, however the algorithm I have is recursive and at each level down it could potentially have the need to create another one of these. That uses a lot of memory and isn't GC friendly. And it's also a waste because very few documents are expected to match this filter. So it appears that a hash based int set would be the right data structure. I'm pretty sure that the docids are always looked up in a sequential order, and that suggests an implementation based on a sorted array of integers. There's some cool code in Solr: SortedIntDocSet that should mostly work well. Remember that Lucene spatial is a Lucene module and only depends on Lucene-core, and the queries module. Well it depends on Spatial4j too but that's not relevant. * SentinalIntSet (lucene-core): Looks nice, however there isn't a set iterator. I need to merge these sets and thus I need to iterate values. The underlying storage is one int array. By default the hash is the int itself, which I like as its fast and I'm not concerned with collissions. * IntHashSet (facet module): This implementation has the value iterator I need. The underlying storage has three int arrays making it considerably more bulky than SentinalIntSet's one. Its excessive, I think. By default the hash appears to be "key & hashFactor" which should be fast. * SortedIntDocSet (Solr): Sophisticated algorithms on the sorted array where you assume sequential id lookups. Nice! But it's intertwined with some other Solr classes and it has more code weight to it than the other two classes and for stuff I don't think I need. * IntOpenHashSet (HPPC — CarrotSearch labs): Looks good. Backed by one int array and one boolean array. I like this implementation better than IntHashSet and SentinalIntSet. However it computes a hash of the id using MurMurHash3, which I think is overkill, and the API doesn't have a subclassing opportunity for me to use something else. The simplest solution is to migrate IntHashSet from the facet module to Lucene core. I'm tempted to create a new class which borrows heavily from SortedIntDocSet. But that's more work and demands that I benchmark things to see if it's worth the extra development time, and I don't have time for that now. Thoughts? ~ David
