Re: Processing query clause combinations at indexing time

Michael Froh Tue, 15 Dec 2020 12:09:35 -0800

Huh... I didn't know about Luwak / the monitoring module. I spent some time
this morning going through it. It takes a very different approach to
matching at indexing time versus what we did, and looks more powerful.
Given that document-matching is one of the harder steps in the process, I'm
quite happy to leverage something that already exists.


The feature we built has two other parts -- an offline training piece and a
query-optimizing piece. They share a QueryVisitor that collects required
clauses. The training step identifies frequently co-occurring combinations
of required clauses (using an FP-Growth implementation) and the query
optimizer adds a matching TermQuery as a filter clause (and removes the
replaced clauses, if they're non-scoring). They're pretty lightweight
compared to document-matching, though.

On Tue, Dec 15, 2020 at 7:41 AM Michael Sokolov <[email protected]> wrote:

> I feel like there could be some considerable overlap with features
> provided by Luwak, which was contributed to Lucene fairly recently,
> and I think does the query inversion work required for this; maybe
> more of it already exists here? I don't know if that module handles
> the query rewriting, or the term indexing you're talking about though.
>
> On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma <[email protected]> wrote:
> >
> > +1
> >
> > I would suggest that this be an independent project hosted on Github
> (there have been similar projects in the past that have seen success that
> way)
> >
> > On Tue, 15 Dec 2020, 09:37 David Smiley, <[email protected]> wrote:
> >>
> >> Great optimization!
> >>
> >> I'm dubious on it being a good contribution to Lucene itself however,
> because what you propose fits cleanly above Lucene.  Even at a ES/Solr
> layer (which I know you don't use, but hypothetically speaking), I'm
> dubious there as well.
> >>
> >> ~ David Smiley
> >> Apache Lucene/Solr Search Developer
> >> http://www.linkedin.com/in/davidwsmiley
> >>
> >>
> >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh <[email protected]> wrote:
> >>>
> >>> My team at work has a neat feature that we've built on top of Lucene
> that has provided a substantial (20%+) increase in maximum qps and some
> reduction in query latency.
> >>>
> >>> Basically, we run a training process that looks at historical queries
> to find frequently co-occurring combinations of required clauses, say "+A
> +B +C +D". Then at indexing time, if a document satisfies one of these
> known combinations, we add a new term to the doc, like "opto:ABCD". At
> query time, we can then replace the required clauses with a single
> TermQuery for the "optimized" term.
> >>>
> >>> It adds a little bit of extra work at indexing time and requires the
> offline training step, but we've found that it yields a significant boost
> at query time.
> >>>
> >>> We're interested in open-sourcing this feature. Is it something worth
> adding to Lucene? Since it doesn't require any core changes, maybe as a
> module?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Processing query clause combinations at indexing time

Reply via email to