[GitHub] [lucene] gsmiller commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

GitBox Fri, 24 Sep 2021 12:42:49 -0700


gsmiller commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r715860923




##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, 
String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count 
and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements 
Iterable<IntIntCursor> {

Review comment:
       I'd suggest looking through some of the other implementations of 
`Facets` since there are a few other ones that might also benefit from this 
work. Anything that utilizes the option of sparse counting into a map would be 
a good candidate. To that end, I would suggest splitting this out into its own 
class to be reused. Ideally it would be pkg-private, but I think the 
sub-packaging under the facets package might prevent us from doing this :(

##########
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -27,14 +29,16 @@
 import org.apache.lucene.facet.FacetsConfig.DimConfig;
 import org.apache.lucene.facet.LabelAndValue;
 import org.apache.lucene.facet.TopOrdAndIntQueue;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.FixedBitSet;
 
 /** Base class for all taxonomy-based facets that aggregate to a per-ords 
int[]. */
 public abstract class IntTaxonomyFacets extends TaxonomyFacets {
 
   /** Per-ordinal value. */
   private final int[] values;
 
-  private final IntIntHashMap sparseValues;
+  private final IntIntHashMapWithFixedBitSet sparseValues;

Review comment:
       It might be worth experimenting with a heuristic-based approach to 
determine when it's worth using this new approach vs. just using the map 
directly. We'd ideally want to only use this new approach when it's likely that 
a large number of the unique ordinals/categories seen during counting are only 
seen once. If most of the ordinals we see when counting are seen more than 
once, this just adds unnecessary overhead. I wonder if there's a heuristic that 
would make sense for approximating the likelihood of a long-tail of single-hit 
ordinals in counting? Maybe some experimentation on the wikipedia dataset 
through `luceneutil` would help?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] gsmiller commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Reply via email to