Another approach for retrieving large result sets can work if you have
a unique sort key. and don't mind retrieving your results sorted by
this key. Then you can retrieve the results in batches using a
cursor-style approach; request the top N sorted by the key. Then
request the top N s.t. the key is greater than the greatest value in
the last batch. rinse and repeat.

On Tue, Mar 8, 2022 at 4:13 AM Uwe Schindler <u...@thetaphi.de> wrote:
>
> Hi,
>
> > For our use case, we need to run queries which return the full
> > matched result set. In some cases, this result set can be large (50k+
> > results out of 4 million total documents).
> > Perf test showed that just 4 threads running random queries returning 50k
> > results make Lucene utilize 100% CPU on a 4-core machine (profiler
> > screenshot
> > <https://user-images.githubusercontent.com/6069066/157188814-fbd9d205-
> > c2e4-45b6-b98d-b7622b6ac801.png>).
>
> This screenshot shows the problem: The search methods returning TopDocs (or 
> TopFieldDocs) should never ever be used to retrieve a larger amount or ALL 
> results. This is called "deep paging" problem. Lucene cannot return "paged" 
> results easily starting at a specific result page, it has to score all 
> results and insert them into a priority queue - this does not scale well 
> because the priority queue approach is made for quuickly getting top-ranking 
> results. So to get all results, don't call: 
> <https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,int)>
>
> If you just want to get all results then you should write your own collector 
> (single threaded as subclass of SimpleCollector, an alternative is 
> CollectorManager for multithreaded search with a separate "reduce" step to 
> merge results of each index segment) that just retrieves document ids and 
> processes them. If you don't need the score, don't call the scoring methods 
> in the Scorable.
>
> For this you have to create a subclass of SimpleCollector (and 
> CollectorManager, if needed) and implement its methods that are called by the 
> query internal as a kind of "notifications" about which index segment you are 
> and which result *relative* to this index segment you. Important things:
> - you get notified about new segments using SimpleCollector#doSetNextReader. 
> Save the content in a local field of the collector for later usage
> - if you need the scores also implement SimpleCollector#setScorer().
> - for each search hit of the reader passed in the previous call you get the 
> SimpleCollector#collect() method called. Use the document id passed and 
> resolve it using the leaf reader to the actual document and its fields/doc 
> values. To get the score ask the Scoreable from previous call.
>
> Another approach is to use searchAfter with smaller windows, but for getting 
> all results this is still slower as a priority queue has to be managed, too 
> (just smaller ones).
>
> > The query is very simple and contains only a single-term filter clause, all
> > unrelated parts of the application are disabled, no stored fields are
> > fetched, GC is doing minimal amount of work
> > <https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc-
> > 41c1-4af1-afcf-37d0c5f86054.png>
>
> Lucene never uses much heap space, so GC should always be low.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to