Re: Processing query clause combinations at indexing time

Michael Froh Tue, 15 Dec 2020 17:05:09 -0800

We don't handle positional queries in our use-case, but that's just because
we don't happen to have many positional queries. But if we identify
documents at indexing time that contain a given phrase/slop/etc. query,
then we can tag the documents with a term that indicates that (or, more
likely, tag documents that contain that positional query AND some other
queries). We can identify documents that match a PhraseQuery, for example,
by adding appending a TokenFilter for the relevant field that "listens" for
the given phrase.

Our use-case has only needed TermQuery, numeric range queries, and
ToParentBlockJoinQuery clauses so far, though. For TermQuery, we can just
listen for individual terms (with a TokenFilter). For range queries, we
look at the IndexableField itself (typically an IntPoint) before submitting
the Document to the IndexWriter. For a ToParentBlockJoinQuery, we can just
apply the matching logic to each child document to detect a match before we
get to the parent. The downside is that for each Query type that we want to
be able to evaluate at indexing time, we need to add explicit support.

We're not scoring at matching time (relying on a static sort instead),
which allows us to remove the matched clauses altogether. That said, if the
match set of the conjunction of required clauses is small (at least smaller
than the match sets of the individual clauses), adding a "precomputed
intersection" filter should advance scorers more efficiently.

Does Lucene's filter caching match on subsets of required clauses? So, for
example, if some queries contain (somewhere in a BooleanQuery tree) clauses
that flatten to "+A +B +C", can I cache that and also have it kick in for a
BooleanQuery containing "+A +B +C +D", turning it into something like
"+cached('+A +B +C') +D" without having to explicitly do a cache lookup for
"+A +B +C"?

I guess another advantage of our approach is that it's effectively a
write-through cache, pushing the filter-matching burden to indexing time.
For read-heavy use-cases, that trade-off is worth it.

On Tue, Dec 15, 2020 at 3:42 PM Robert Muir <rcm...@gmail.com> wrote:

> What are you doing with positional queries though? And how does the
> scoring work (it is unclear from your previous reply to me whether you
> are scoring).
>
> Lucene has filter caching too, so if you are doing this for
> non-scoring cases maybe something is off?
>
> On Tue, Dec 15, 2020 at 3:19 PM Michael Froh <msf...@gmail.com> wrote:
> >
> > It's conceptually similar to CommonGrams in the single-field case,
> though it doesn't require terms to appear in any particular positions.
> >
> > It's also able to match across fields, which is where we get a lot of
> benefit. We have frequently-occurring filters that get added by various
> front-end layers before they hit us (which vary depending on where the
> query comes from). In that regard, it's kind of like Solr's filter cache,
> except that we identify the filters offline by analyzing query logs, find
> common combinations of filters (especially ones where the intersection is
> smaller than the smallest term's postings list), and cache the filters in
> the index the next time we reindex.
> >
> > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir <rcm...@gmail.com> wrote:
> >>
> >> See also commongrams which is a very similar concept:
> >>
> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams
> >>
> >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir <rcm...@gmail.com> wrote:
> >> >
> >> > I wonder if it can be done in a fairly clean way. This sounds similar
> >> > to using a ShingleFilter to do this optimization, but adding some
> >> > conditionals so that the index is smaller? Now that we have
> >> > ConditionalTokenFilter (for branching), can the feature be implemented
> >> > cleanly?
> >> >
> >> > Ideally it wouldn't require a lot of new code, something like checking
> >> > a "set" + conditionaltokenfilter + shinglefilter?
> >> >
> >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh <msf...@gmail.com>
> wrote:
> >> > >
> >> > > My team at work has a neat feature that we've built on top of
> Lucene that has provided a substantial (20%+) increase in maximum qps and
> some reduction in query latency.
> >> > >
> >> > > Basically, we run a training process that looks at historical
> queries to find frequently co-occurring combinations of required clauses,
> say "+A +B +C +D". Then at indexing time, if a document satisfies one of
> these known combinations, we add a new term to the doc, like "opto:ABCD".
> At query time, we can then replace the required clauses with a single
> TermQuery for the "optimized" term.
> >> > >
> >> > > It adds a little bit of extra work at indexing time and requires
> the offline training step, but we've found that it yields a significant
> boost at query time.
> >> > >
> >> > > We're interested in open-sourcing this feature. Is it something
> worth adding to Lucene? Since it doesn't require any core changes, maybe as
> a module?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Processing query clause combinations at indexing time

Reply via email to