Another approach for retrieving large result sets can work if you have a unique sort key. and don't mind retrieving your results sorted by this key. Then you can retrieve the results in batches using a cursor-style approach; request the top N sorted by the key. Then request the top N s.t. the key is greater than the greatest value in the last batch. rinse and repeat.
On Tue, Mar 8, 2022 at 4:13 AM Uwe Schindler <u...@thetaphi.de> wrote: > > Hi, > > > For our use case, we need to run queries which return the full > > matched result set. In some cases, this result set can be large (50k+ > > results out of 4 million total documents). > > Perf test showed that just 4 threads running random queries returning 50k > > results make Lucene utilize 100% CPU on a 4-core machine (profiler > > screenshot > > <https://user-images.githubusercontent.com/6069066/157188814-fbd9d205- > > c2e4-45b6-b98d-b7622b6ac801.png>). > > This screenshot shows the problem: The search methods returning TopDocs (or > TopFieldDocs) should never ever be used to retrieve a larger amount or ALL > results. This is called "deep paging" problem. Lucene cannot return "paged" > results easily starting at a specific result page, it has to score all > results and insert them into a priority queue - this does not scale well > because the priority queue approach is made for quuickly getting top-ranking > results. So to get all results, don't call: > <https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,int)> > > If you just want to get all results then you should write your own collector > (single threaded as subclass of SimpleCollector, an alternative is > CollectorManager for multithreaded search with a separate "reduce" step to > merge results of each index segment) that just retrieves document ids and > processes them. If you don't need the score, don't call the scoring methods > in the Scorable. > > For this you have to create a subclass of SimpleCollector (and > CollectorManager, if needed) and implement its methods that are called by the > query internal as a kind of "notifications" about which index segment you are > and which result *relative* to this index segment you. Important things: > - you get notified about new segments using SimpleCollector#doSetNextReader. > Save the content in a local field of the collector for later usage > - if you need the scores also implement SimpleCollector#setScorer(). > - for each search hit of the reader passed in the previous call you get the > SimpleCollector#collect() method called. Use the document id passed and > resolve it using the leaf reader to the actual document and its fields/doc > values. To get the score ask the Scoreable from previous call. > > Another approach is to use searchAfter with smaller windows, but for getting > all results this is still slower as a priority queue has to be managed, too > (just smaller ones). > > > The query is very simple and contains only a single-term filter clause, all > > unrelated parts of the application are disabled, no stored fields are > > fetched, GC is doing minimal amount of work > > <https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc- > > 41c1-4af1-afcf-37d0c5f86054.png> > > Lucene never uses much heap space, so GC should always be low. > > Uwe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org