Hi,

> For our use case, we need to run queries which return the full
> matched result set. In some cases, this result set can be large (50k+
> results out of 4 million total documents).
> Perf test showed that just 4 threads running random queries returning 50k
> results make Lucene utilize 100% CPU on a 4-core machine (profiler
> screenshot
> <https://user-images.githubusercontent.com/6069066/157188814-fbd9d205-
> c2e4-45b6-b98d-b7622b6ac801.png>).

This screenshot shows the problem: The search methods returning TopDocs (or 
TopFieldDocs) should never ever be used to retrieve a larger amount or ALL 
results. This is called "deep paging" problem. Lucene cannot return "paged" 
results easily starting at a specific result page, it has to score all results 
and insert them into a priority queue - this does not scale well because the 
priority queue approach is made for quuickly getting top-ranking results. So to 
get all results, don't call: 
<https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,int)>

If you just want to get all results then you should write your own collector 
(single threaded as subclass of SimpleCollector, an alternative is 
CollectorManager for multithreaded search with a separate "reduce" step to 
merge results of each index segment) that just retrieves document ids and 
processes them. If you don't need the score, don't call the scoring methods in 
the Scorable.

For this you have to create a subclass of SimpleCollector (and 
CollectorManager, if needed) and implement its methods that are called by the 
query internal as a kind of "notifications" about which index segment you are 
and which result *relative* to this index segment you. Important things:
- you get notified about new segments using SimpleCollector#doSetNextReader. 
Save the content in a local field of the collector for later usage
- if you need the scores also implement SimpleCollector#setScorer().
- for each search hit of the reader passed in the previous call you get the 
SimpleCollector#collect() method called. Use the document id passed and resolve 
it using the leaf reader to the actual document and its fields/doc values. To 
get the score ask the Scoreable from previous call. 

Another approach is to use searchAfter with smaller windows, but for getting 
all results this is still slower as a priority queue has to be managed, too 
(just smaller ones).

> The query is very simple and contains only a single-term filter clause, all
> unrelated parts of the application are disabled, no stored fields are
> fetched, GC is doing minimal amount of work
> <https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc-
> 41c1-4af1-afcf-37d0c5f86054.png>

Lucene never uses much heap space, so GC should always be low.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to