Christoph Kaser created LUCENE-8542:
---------------------------------------
Summary: Provide the LeafSlice to CollectorManager.newCollector to
save memory on small index slices
Key: LUCENE-8542
URL: https://issues.apache.org/jira/browse/LUCENE-8542
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Reporter: Christoph Kaser
I have an index consisting of 44 million documents spread across 60 segments.
When I run a query against this index with a huge number of results requested
(e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch was
configured to use an ExecutorService.
(I know this kind of query is fairly unusual and it would be better to use
paging and searchAfter, but our architecture does not allow this at the moment.)
The reason for the huge memory requirement is that the search [will create a
TopScoreDocCollector for each
segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
each one with numHits = 5 million. This is fine for the large segments, but
many of those segments are fairly small and only contain several thousand
documents. This wastes a huge amount of memory for queries with large values of
numHits on indices with many segments.
Therefore, I propose to change the CollectorManager - interface in the
following way:
* change the method newCollector to accept a parameter LeafSlice that can be
used to determine the total count of documents in the LeafSlice
* Maybe, in order to remain backwards compatible, it would be possible to
introduce this as a new method with a default implementation that calls the old
method - otherwise, it probably has to wait for Lucene 8?
* This can then be used to cap numHits for each TopScoreDocCollector to the
leafslice-size.
If this is something that would make sense for you, I can try to provide a
patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]