Re: Looking for resources to understand query cost/complexity

Adrien Grand Fri, 21 Feb 2025 06:01:08 -0800

This depends on many factors, but in my experience these two are good
starting points:
 - Total number of matching docs of the query.
 - Number of segments times number of terms being looked up.

This is a simplified model, some queries incur their own costs, e.g. phrase
queries bottleneck on evaluating positions. Also some collectors have
optimizations, e.g. computing the top-k hits by score or field usually only
requires evaluating a fraction of the doc ID space (assuming a low k) and
is much faster than evaluating all hits (e.g. if computing facets).

> High level users often construct user queries of the form (all applied
> to the article text) of:
>
>        (a OR b OR c ...) AND (d OR e OR f ...) ...
>
> At a simple level, how do costs acrue?  By simple count of (lower
> case) terms, or by the product of the sums of OR terms??

Lucene has an internal "cost" metric that it uses to make decisions about
which clauses should be evaluated first, which is essentially an
estimation of the number of matching docs (my first point in the previous
paragraph). On term queries, it's evaluated as the number of matching docs
of the term. Disjunctions evaluate this cost as the sum of the costs of
their clauses. Conjunctions evaluate this cost as the min cost of their
clauses. So in your example, the cost would be:

min(cost(a) + cost(b) + cost(c) + ..., cost(d) + cost(e) + cost(f) + ...)

The costs of term lookups accrue linearly though, so it would be the sum
across all clauses regardless of whether they are in a conjunction or a
disjunction.

> Do the wildcard matches simple increase the number of terms,
> or are there other major costs?

The number of terms is an important factor indeed, and is often a cause of
slowness as it is very easy to create a wildcard expression that matches
LOTS of terms.

Another one is the cost of evaluating terms that match the wildcard
expression. As term dictionaries are sorted, it's reasonably cheap to find
matching terms of prefix queries ("e.g. foo*) but evaluating the matching
terms of an expression with a leading wildcard (e.g. "*foo") almost always
requires checking every single term of every single segment against the
wildcard expression.

This is only the surface of query cost, but hopefully this helps.

On Thu, Feb 20, 2025 at 7:30 PM Phil Budne <p...@ultimate.com> wrote:

> Please forgive me if this is the wrong place to ask, the actual
> application is using Elasticsearch, but my understanding is that all
> the actual searches are done at the ES shard level by Lucene.  If
> there is a better place to ask these questions, please let me know.
>
> I'm trying to understand the CPU, memory and I/O costs of queries,
> which started when I wanted to construct test queries to use to report
> system response.
>
> Is there documentation, a book, or a particular layer of code to look
> at to get a understanding of these costs?
>
> Disclaimer: my background is not in searching, indexing, NLP, I'm a
> "systems person" who has a broad interest in how parts of a software
> system interact.
>
> The corpus is a series of Elasticsearch indexes, broken up by ES
> "Index Livetime Management" to limit ES Shard (Lucene index) size.
>
> The documents are news articles with a source (domain name, with
> keyword mapping), an extracted "publication date" (date mapping), and
> text (keyword mapping).  Articles are not necessarily added in
> publication date order (although I have proposed partitioning the
> indices by publication year).
>
> Queries have three aspects:
>
> 1. They always have a date range for publication dates to consider.
> 2. They almost always have a list of source domains to consider
>         (currently expressed as a query string domain:[this OR that OR
> ...])
> 3. They almost always have a user query string
>         (sometimes omitted to get the overall number of articles
>         to normalize result counts)
>
> The first two are applied as Elastic "filters".
>
> Do the number of days, number of domains, and number of query
> terms have equal impact?
>
> High level users often construct user queries of the form (all applied
> to the article text) of:
>
>         (a OR b OR c ...) AND (d OR e OR f ...) ...
>
> At a simple level, how do costs acrue?  By simple count of (lower
> case) terms, or by the product of the sums of OR terms??
>
> Sometimes queries contain wildcards:
>
>         (a* OR b OR c ...) AND (d* OR e OR f ...) ...
>
> Do the wildcard matches simple increase the number of terms,
> or are there other major costs?
>
> Thanks in advance,
> Phil
>
> P.S.
> https://lucene.apache.org/core/discussion.html points to an IRC
> channel at freenode.net, but it's been down any time I've tried the
> link, and the slack channel seems to require an ASF afilliated email.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien

Re: Looking for resources to understand query cost/complexity

Reply via email to