Hi Marc, You seem to hit all the questions we had too :)
The 10k vs 100k sample size was mainly influenced by the users, 100k is slower but more accurate and less chance of missing that 1 doc that had the outlier value. Our basic hybrid approach that we settled on in our application was to show a maximum number of facet values (i.e. the top N, 5, 10 of whatever the user wants). If there were less than N values; we check if there exists more values by leveraging the index. For each FacetField we know the number of FacetValues associated from the TaxonomyIndex. So, for example, you know there exists 12 facet values, but your random-sample only returned 2. Because of the nature of sampling, you will find the values that occur the most. To 'fix' the missing values, we do another search (excluding the already found FacetValues) and because that is only a small amount of hits, it tends to be very fast. Of course, you only need to do this if you know there might be more values. If the sample returned all possible FacetValues, you are done and don't need that second search. (Another approach we tried, which is faster but less nice to the users, is just to include the missed values with an estimated count of 0. In that case you might show FacetValues that do not occur in the result,) Then for the 'counts'. Because they are estimated, they are probably off. We rounded the counts down, keeping only the first digit. So, 143.232 as amortized count would get displayed as ~100.000, 201.000 as ~200.000 etc. Rounding down makes the count almost never higher than the real count that you would get when not sampling the result. As the number of hits is that high, users probably don't care too much about the exact numbers but more about the relative counts. Hope this helps! -Rob On Wed, Oct 9, 2024 at 2:16 AM Marc Davenport <madavenp...@cargurus.com.invalid> wrote: > Hi Rob, > How did you measure accuracy when finding that sweet spot between speed and > accuracy for you. I'm trying to find a reasonable way to characterize the > error introduced by sampling. For example, if one facet value would have a > count of 1234 if done directly, but 1233 when done with sampling, that > seems fine. What I'm worried about is a facet value that is rare and would > have a count of 1 when done directly, but not show up at all due to > sampling error. Perhaps I'm misunderstanding the way the sampling is done > and that later case cannot happen. > > Marc > > > On Tue, Oct 8, 2024 at 1:27 PM Rob Audenaerde <rob.audenae...@gmail.com> > wrote: > > > Hi Marc, > > > > I worked extensively on an application that leveraged facet counts in > > lucene 8 series (and also aggregation by leveraging the facet fields, > > albeit with a custom implementation) for documents sets with over 100M > > documents. We settled for random sampling of the number of hits was > greater > > than 100k, as tradeoff between speed and accuracy in the results. > > > > We ended up not using drill sideways but keeping the state of the last > > changed facet field unselected values when interacting with that specific > > field. Not sure if that fits your use case, but it is a typical user > > interaction when searching and filtering by facets. > > > > > > > > > > > > On Tue, Oct 8, 2024, 17:29 Marc Davenport <madavenp...@cargurus.com > > .invalid> > > wrote: > > > > > Thanks Stefan, > > > > > > I will look into the both refactoring to use drillsideways as well as > the > > > new aggregation engine. It might be a decent size lift on our end to > > > reorganize our code to do that. For now, I've switched to using the > > random > > > sampling facet collector when we suspect that it will be a larger > query. > > > That has definitely compressed the results of our queries into a more > > > acceptable time. We are still tuning the threshold and I just spiked > 10k > > > as a first guess at a threshold for the sampling collector. I have > > noticed > > > that some of our queries are slower using the sampling collector when > > they > > > are just above that threshold. But more tuning will be done. > > > Thanks! > > > Marc > > > > > > On Wed, Oct 2, 2024 at 7:37 AM Stefan Vodita <stefan.vod...@gmail.com> > > > wrote: > > > > > > > Hi Marc, > > > > > > > > I'm curious what version of Lucene you're using. > > > > > > > > Outside that, I can give two pointers. > > > > > > > > 1. I think you're right to want to look into using DrillSideways for > > your > > > > use-case. There are some examples in the demo package [1], which > > > > should be helpful. > > > > > > > > 2. There is a new aggregation engine [2] in Lucene 9.12, in the > sandbox > > > > module for now, if you're willing to consider it. It facets at > > match-time > > > > and is > > > > generally faster than the faceting we had before 9.12. > > > > > > > > Stefan > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/lucene/tree/main/lucene/demo/src/java/org/apache/lucene/demo/facet > > > > [2] https://github.com/apache/lucene/pull/13568 > > > > > > > > > > > > On Mon, 30 Sept 2024 at 19:26, Marc Davenport > > > > <madavenp...@cargurus.com.invalid> wrote: > > > > > > > > > I've been looking at the way our code gets the facet counts from > > Lucene > > > > and > > > > > see if there are some obvious inefficiencies. We have about 60 > > normal > > > > flat > > > > > facets, some of which are multi-valued, and 5 or so hierarchical > and > > > > > multi-valued facets. I'm seeing cases where the call to create a > > > > > FastTaxonomyFacetCounts is taking 1+ seconds when it would be > > matching > > > on > > > > > 800k documents. This leads me to believe I've got some > > implementation > > > > > flaw. Are there any common errors people make when implementing > > > facets? > > > > > Known trouble spots that I should investigate? > > > > > > > > > > Right now we retrieve the counts for the facets independently from > > the > > > > > retrieval of matching documents. Each facet has its own runner > > which > > > > will > > > > > calculate its current counts as well as a more relaxed query state > > that > > > > > will show its other values. Different facets will share a cached > > facet > > > > > collector if they have the same query state. I know the "hold one > > > out" > > > > > pattern isn't ideal. I am looking at how we could use the > > > > > drillsideways queries, but I'm not sure I totally understand them. > > > > > > > > > > The FastTaxonomyFacetCounts creation speed is in relation to the > > number > > > > and > > > > > cardinality of the facets on the documents. We pruned off no longer > > > > needed > > > > > facets. Would it make sense to start maintaining more than one > > > Taxonomy > > > > > Index? > > > > > > > > > > I've been looking for any good books or resources to read about > > lucene. > > > > I > > > > > have the original Lucene in action, which has been helpful in some > > > ways, > > > > > but covers only v3. Many newer concepts are sort of left to java > doc, > > > or > > > > > reading through the PRs. Any suggestions on things to read to > > better > > > > > understand Lucene and it's proper use? > > > > > > > > > > Thank you, > > > > > Marc > > > > > > > > > > > > > > >