Hi Marc,

I worked extensively on an application that leveraged facet counts in
lucene 8 series (and also aggregation by leveraging the facet fields,
albeit with a custom implementation) for documents sets with over 100M
documents. We settled for random sampling of the number of hits was greater
than 100k, as tradeoff between speed and accuracy in the results.

We ended up not using drill sideways but keeping the state of the last
changed facet field unselected values when interacting with that specific
field. Not sure if that fits your use case, but it is a typical user
interaction when searching and filtering by facets.





On Tue, Oct 8, 2024, 17:29 Marc Davenport <madavenp...@cargurus.com.invalid>
wrote:

> Thanks Stefan,
>
> I will look into the both refactoring to use drillsideways as well as the
> new aggregation engine.  It might be a decent size lift on our end to
> reorganize our code to do that.  For now, I've switched to using the random
> sampling facet collector when we suspect that it will be a larger query.
> That has definitely compressed the results of our queries into a more
> acceptable time.  We are still tuning the threshold and I just spiked 10k
> as a first guess at a threshold for the sampling collector.  I have noticed
> that some of our queries are slower using the sampling collector when they
> are just above that threshold.   But more tuning will be done.
> Thanks!
> Marc
>
> On Wed, Oct 2, 2024 at 7:37 AM Stefan Vodita <stefan.vod...@gmail.com>
> wrote:
>
> > Hi Marc,
> >
> > I'm curious what version of Lucene you're using.
> >
> > Outside that, I can give two pointers.
> >
> > 1. I think you're right to want to look into using DrillSideways for your
> > use-case. There are some examples in the demo package [1], which
> > should be helpful.
> >
> > 2. There is a new aggregation engine [2] in Lucene 9.12, in the sandbox
> > module for now, if you're willing to consider it. It facets at match-time
> > and is
> > generally faster than the faceting we had before 9.12.
> >
> > Stefan
> >
> > [1]
> >
> >
> https://github.com/apache/lucene/tree/main/lucene/demo/src/java/org/apache/lucene/demo/facet
> > [2] https://github.com/apache/lucene/pull/13568
> >
> >
> > On Mon, 30 Sept 2024 at 19:26, Marc Davenport
> > <madavenp...@cargurus.com.invalid> wrote:
> >
> > > I've been looking at the way our code gets the facet counts from Lucene
> > and
> > > see if there are some obvious inefficiencies.  We have about 60 normal
> > flat
> > > facets, some of which are multi-valued, and 5 or so hierarchical and
> > > multi-valued facets. I'm seeing cases where the call to create a
> > > FastTaxonomyFacetCounts is taking 1+ seconds when it would be matching
> on
> > > 800k documents.  This leads me to believe I've got some implementation
> > > flaw.  Are there any common errors people make when implementing
> facets?
> > > Known trouble spots that I should investigate?
> > >
> > > Right now we retrieve the counts for the facets independently from the
> > > retrieval of matching documents.   Each facet has its own runner which
> > will
> > > calculate its current counts as well as a more relaxed query state that
> > > will show its other values.  Different facets will share a cached facet
> > > collector if they have the same query state.   I know the "hold one
> out"
> > > pattern isn't ideal.  I am looking at how we could use the
> > > drillsideways queries, but I'm not sure I totally understand them.
> > >
> > > The FastTaxonomyFacetCounts creation speed is in relation to the number
> > and
> > > cardinality of the facets on the documents. We pruned off no longer
> > needed
> > > facets.  Would it make sense to start maintaining more than one
> Taxonomy
> > > Index?
> > >
> > > I've been looking for any good books or resources to read about lucene.
> > I
> > > have the original Lucene in action, which has been helpful in some
> ways,
> > > but covers only v3. Many newer concepts are sort of left to java doc,
> or
> > > reading through the PRs.   Any suggestions on things to read to better
> > > understand Lucene and it's proper use?
> > >
> > > Thank you,
> > > Marc
> > >
> >
>

Reply via email to