Huh... I didn't know about Luwak / the monitoring module. I spent some time this morning going through it. It takes a very different approach to matching at indexing time versus what we did, and looks more powerful. Given that document-matching is one of the harder steps in the process, I'm quite happy to leverage something that already exists.
The feature we built has two other parts -- an offline training piece and a query-optimizing piece. They share a QueryVisitor that collects required clauses. The training step identifies frequently co-occurring combinations of required clauses (using an FP-Growth implementation) and the query optimizer adds a matching TermQuery as a filter clause (and removes the replaced clauses, if they're non-scoring). They're pretty lightweight compared to document-matching, though. On Tue, Dec 15, 2020 at 7:41 AM Michael Sokolov <[email protected]> wrote: > I feel like there could be some considerable overlap with features > provided by Luwak, which was contributed to Lucene fairly recently, > and I think does the query inversion work required for this; maybe > more of it already exists here? I don't know if that module handles > the query rewriting, or the term indexing you're talking about though. > > On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma <[email protected]> wrote: > > > > +1 > > > > I would suggest that this be an independent project hosted on Github > (there have been similar projects in the past that have seen success that > way) > > > > On Tue, 15 Dec 2020, 09:37 David Smiley, <[email protected]> wrote: > >> > >> Great optimization! > >> > >> I'm dubious on it being a good contribution to Lucene itself however, > because what you propose fits cleanly above Lucene. Even at a ES/Solr > layer (which I know you don't use, but hypothetically speaking), I'm > dubious there as well. > >> > >> ~ David Smiley > >> Apache Lucene/Solr Search Developer > >> http://www.linkedin.com/in/davidwsmiley > >> > >> > >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh <[email protected]> wrote: > >>> > >>> My team at work has a neat feature that we've built on top of Lucene > that has provided a substantial (20%+) increase in maximum qps and some > reduction in query latency. > >>> > >>> Basically, we run a training process that looks at historical queries > to find frequently co-occurring combinations of required clauses, say "+A > +B +C +D". Then at indexing time, if a document satisfies one of these > known combinations, we add a new term to the doc, like "opto:ABCD". At > query time, we can then replace the required clauses with a single > TermQuery for the "optimized" term. > >>> > >>> It adds a little bit of extra work at indexing time and requires the > offline training step, but we've found that it yields a significant boost > at query time. > >>> > >>> We're interested in open-sourcing this feature. Is it something worth > adding to Lucene? Since it doesn't require any core changes, maybe as a > module? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
