[ 
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704391#comment-16704391
 ] 

Christoph Kaser edited comment on LUCENE-8542 at 11/30/18 8:28 AM:
-------------------------------------------------------------------

I think it would be nice to have the option to grow the heap dynamically. 
However the way _TopScoreDocCollector_ and _TopDocsCollector_ are currently 
built, for a lucene user that would mean copying the complete source code for 
those classes and adopting them to use a _java.util.PriorityQueue_ (probably 
with worse performance than _org.apache.lucene.util.PriorityQueue_).

This is certainly possible, but would mean a lot of code duplication (from the 
perspective of a lucene user, because the used priority queue can't be changed 
easily),

I think that this patch makes sense anyway: The size of segments has a very 
wide range in a typical index, and usually there are a lot more small segments 
than large ones. Given that the default implementation of 
IndexSearcher.slices() returns one slice per segment, that means a lot of 
wasted memory for all queries that have a _numHits_ greater than the typical 
size of a small segment. I don't think it has any negative impact on queries 
with a small value of numHits, because it only adds one Math.min per segment.

It also helps with my problem: for an index with 28 segments and 13,360,068 
documents and a search with numhits=5,000,000, it makes the difference between 
creating priority queues with a combined size of 140,000,000 vs 13,360,068. As 
you can see in the following table, there are benefits for searches with a more 
reasonable numHits value as well (all against my index):

 
||numHits||Combined size w/o patch||Combined size with patch||
|10,000,000|280,000,000|13,360,068|
|5,000,000|140,000,000|13,360,068|
|1,000,000|28,000,000|6,870,854|
|100,000|2,800,000|1,632,997|
|50,000|1,400,000|1,015,274|
|10,000|280,000|252,528|

 


was (Author: christophk):
I think it would be nice to have the option to grow the heap dynamically. 
However the way _TopScoreDocCollector_ and _TopDocsCollector_ are currently 
built, for a lucene user that would mean copying the complete source code for 
those classes and adopting them to use a _java.util.PriorityQueue_ (probably 
with worse performance than _org.apache.lucene.util.PriorityQueue_).

This is certainly possible, but would mean a lot of code duplication (from the 
perspective of a lucene user, because the used priority queue can't be changed 
easily),

I think that this patch makes sense anyway: The size of segments has a very 
wide range in a typical index, and usually there are a lot more small segments 
than large ones. Given that the default implementation of 
IndexSearcher.slices() returns one slice per segment, that means a lot of 
wasted memory for all queries that have a _numHits_ greater than the typical 
size of a small segment. I don't think it has any negative impact on queries 
with a small value of numHits, because it only adds one Math.min per segment.

It also helps with my problem: for an index with 28 segments and 13,360,068 
documents and a search with numhits=5,000,000, it makes the difference between 
creating priority queues with a combined size of 140,000,000 vs 13,360,068.

> Provide the LeafSlice to CollectorManager.newCollector to save memory on 
> small index slices
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8542
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Christoph Kaser
>            Priority: Minor
>         Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. 
> When I run a query against this index with a huge number of results requested 
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch 
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use 
> paging and searchAfter, but our architecture does not allow this at the 
> moment.)
> The reason for the huge memory requirement is that the search [will create a 
> TopScoreDocCollector for each 
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
>  each one with numHits = 5 million. This is fine for the large segments, but 
> many of those segments are fairly small and only contain several thousand 
> documents. This wastes a huge amount of memory for queries with large values 
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the 
> following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be 
> used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to 
> introduce this as a new method with a default implementation that calls the 
> old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the 
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to