gsmiller commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r715860923
##########
File path:
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim,
String... path) throws I
return new FacetResult(dim, path, totValue, labelValues, childCount);
}
+
+ /**
+ * Class that uses FixedBitSet to store counts for all ordinals with 1 count
and IntIntHashMap for
+ * all other counts
+ */
+ protected static class IntIntHashMapWithFixedBitSet implements
Iterable<IntIntCursor> {
Review comment:
I'd suggest looking through some of the other implementations of
`Facets` since there are a few other ones that might also benefit from this
work. Anything that utilizes the option of sparse counting into a map would be
a good candidate. To that end, I would suggest splitting this out into its own
class to be reused. Ideally it would be pkg-private, but I think the
sub-packaging under the facets package might prevent us from doing this :(
##########
File path:
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -27,14 +29,16 @@
import org.apache.lucene.facet.FacetsConfig.DimConfig;
import org.apache.lucene.facet.LabelAndValue;
import org.apache.lucene.facet.TopOrdAndIntQueue;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.FixedBitSet;
/** Base class for all taxonomy-based facets that aggregate to a per-ords
int[]. */
public abstract class IntTaxonomyFacets extends TaxonomyFacets {
/** Per-ordinal value. */
private final int[] values;
- private final IntIntHashMap sparseValues;
+ private final IntIntHashMapWithFixedBitSet sparseValues;
Review comment:
It might be worth experimenting with a heuristic-based approach to
determine when it's worth using this new approach vs. just using the map
directly. We'd ideally want to only use this new approach when it's likely that
a large number of the unique ordinals/categories seen during counting are only
seen once. If most of the ordinals we see when counting are seen more than
once, this just adds unnecessary overhead. I wonder if there's a heuristic that
would make sense for approximating the likelihood of a long-tail of single-hit
ordinals in counting? Maybe some experimentation on the wikipedia dataset
through `luceneutil` would help?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]