Please forgive me if this is the wrong place to ask, the actual
application is using Elasticsearch, but my understanding is that all
the actual searches are done at the ES shard level by Lucene.  If
there is a better place to ask these questions, please let me know.

I'm trying to understand the CPU, memory and I/O costs of queries,
which started when I wanted to construct test queries to use to report
system response.

Is there documentation, a book, or a particular layer of code to look
at to get a understanding of these costs?

Disclaimer: my background is not in searching, indexing, NLP, I'm a
"systems person" who has a broad interest in how parts of a software
system interact.

The corpus is a series of Elasticsearch indexes, broken up by ES
"Index Livetime Management" to limit ES Shard (Lucene index) size.

The documents are news articles with a source (domain name, with
keyword mapping), an extracted "publication date" (date mapping), and
text (keyword mapping).  Articles are not necessarily added in
publication date order (although I have proposed partitioning the
indices by publication year).

Queries have three aspects:

1. They always have a date range for publication dates to consider.
2. They almost always have a list of source domains to consider
        (currently expressed as a query string domain:[this OR that OR ...])
3. They almost always have a user query string
        (sometimes omitted to get the overall number of articles
        to normalize result counts)

The first two are applied as Elastic "filters".

Do the number of days, number of domains, and number of query
terms have equal impact?

High level users often construct user queries of the form (all applied
to the article text) of:

        (a OR b OR c ...) AND (d OR e OR f ...) ...

At a simple level, how do costs acrue?  By simple count of (lower
case) terms, or by the product of the sums of OR terms??

Sometimes queries contain wildcards:

        (a* OR b OR c ...) AND (d* OR e OR f ...) ...

Do the wildcard matches simple increase the number of terms,
or are there other major costs?

Thanks in advance,
Phil

P.S.
https://lucene.apache.org/core/discussion.html points to an IRC
channel at freenode.net, but it's been down any time I've tried the
link, and the slack channel seems to require an ASF afilliated email.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to