Hi, > For our use case, we need to run queries which return the full > matched result set. In some cases, this result set can be large (50k+ > results out of 4 million total documents). > Perf test showed that just 4 threads running random queries returning 50k > results make Lucene utilize 100% CPU on a 4-core machine (profiler > screenshot > <https://user-images.githubusercontent.com/6069066/157188814-fbd9d205- > c2e4-45b6-b98d-b7622b6ac801.png>).
This screenshot shows the problem: The search methods returning TopDocs (or TopFieldDocs) should never ever be used to retrieve a larger amount or ALL results. This is called "deep paging" problem. Lucene cannot return "paged" results easily starting at a specific result page, it has to score all results and insert them into a priority queue - this does not scale well because the priority queue approach is made for quuickly getting top-ranking results. So to get all results, don't call: <https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,int)> If you just want to get all results then you should write your own collector (single threaded as subclass of SimpleCollector, an alternative is CollectorManager for multithreaded search with a separate "reduce" step to merge results of each index segment) that just retrieves document ids and processes them. If you don't need the score, don't call the scoring methods in the Scorable. For this you have to create a subclass of SimpleCollector (and CollectorManager, if needed) and implement its methods that are called by the query internal as a kind of "notifications" about which index segment you are and which result *relative* to this index segment you. Important things: - you get notified about new segments using SimpleCollector#doSetNextReader. Save the content in a local field of the collector for later usage - if you need the scores also implement SimpleCollector#setScorer(). - for each search hit of the reader passed in the previous call you get the SimpleCollector#collect() method called. Use the document id passed and resolve it using the leaf reader to the actual document and its fields/doc values. To get the score ask the Scoreable from previous call. Another approach is to use searchAfter with smaller windows, but for getting all results this is still slower as a priority queue has to be managed, too (just smaller ones). > The query is very simple and contains only a single-term filter clause, all > unrelated parts of the application are disabled, no stored fields are > fetched, GC is doing minimal amount of work > <https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc- > 41c1-4af1-afcf-37d0c5f86054.png> Lucene never uses much heap space, so GC should always be low. Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org