[
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704391#comment-16704391
]
Christoph Kaser edited comment on LUCENE-8542 at 11/30/18 8:28 AM:
-------------------------------------------------------------------
I think it would be nice to have the option to grow the heap dynamically.
However the way _TopScoreDocCollector_ and _TopDocsCollector_ are currently
built, for a lucene user that would mean copying the complete source code for
those classes and adopting them to use a _java.util.PriorityQueue_ (probably
with worse performance than _org.apache.lucene.util.PriorityQueue_).
This is certainly possible, but would mean a lot of code duplication (from the
perspective of a lucene user, because the used priority queue can't be changed
easily),
I think that this patch makes sense anyway: The size of segments has a very
wide range in a typical index, and usually there are a lot more small segments
than large ones. Given that the default implementation of
IndexSearcher.slices() returns one slice per segment, that means a lot of
wasted memory for all queries that have a _numHits_ greater than the typical
size of a small segment. I don't think it has any negative impact on queries
with a small value of numHits, because it only adds one Math.min per segment.
It also helps with my problem: for an index with 28 segments and 13,360,068
documents and a search with numhits=5,000,000, it makes the difference between
creating priority queues with a combined size of 140,000,000 vs 13,360,068. As
you can see in the following table, there are benefits for searches with a more
reasonable numHits value as well (all against my index):
||numHits||Combined size w/o patch||Combined size with patch||
|10,000,000|280,000,000|13,360,068|
|5,000,000|140,000,000|13,360,068|
|1,000,000|28,000,000|6,870,854|
|100,000|2,800,000|1,632,997|
|50,000|1,400,000|1,015,274|
|10,000|280,000|252,528|
was (Author: christophk):
I think it would be nice to have the option to grow the heap dynamically.
However the way _TopScoreDocCollector_ and _TopDocsCollector_ are currently
built, for a lucene user that would mean copying the complete source code for
those classes and adopting them to use a _java.util.PriorityQueue_ (probably
with worse performance than _org.apache.lucene.util.PriorityQueue_).
This is certainly possible, but would mean a lot of code duplication (from the
perspective of a lucene user, because the used priority queue can't be changed
easily),
I think that this patch makes sense anyway: The size of segments has a very
wide range in a typical index, and usually there are a lot more small segments
than large ones. Given that the default implementation of
IndexSearcher.slices() returns one slice per segment, that means a lot of
wasted memory for all queries that have a _numHits_ greater than the typical
size of a small segment. I don't think it has any negative impact on queries
with a small value of numHits, because it only adds one Math.min per segment.
It also helps with my problem: for an index with 28 segments and 13,360,068
documents and a search with numhits=5,000,000, it makes the difference between
creating priority queues with a combined size of 140,000,000 vs 13,360,068.
> Provide the LeafSlice to CollectorManager.newCollector to save memory on
> small index slices
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-8542
> URL: https://issues.apache.org/jira/browse/LUCENE-8542
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Christoph Kaser
> Priority: Minor
> Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments.
> When I run a query against this index with a huge number of results requested
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use
> paging and searchAfter, but our architecture does not allow this at the
> moment.)
> The reason for the huge memory requirement is that the search [will create a
> TopScoreDocCollector for each
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
> each one with numHits = 5 million. This is fine for the large segments, but
> many of those segments are fairly small and only contain several thousand
> documents. This wastes a huge amount of memory for queries with large values
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the
> following way:
> * change the method newCollector to accept a parameter LeafSlice that can be
> used to determine the total count of documents in the LeafSlice
> * Maybe, in order to remain backwards compatible, it would be possible to
> introduce this as a new method with a default implementation that calls the
> old method - otherwise, it probably has to wait for Lucene 8?
> * This can then be used to cap numHits for each TopScoreDocCollector to the
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a
> patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]