[
https://issues.apache.org/jira/browse/SOLR-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233876#comment-15233876
]
David Smiley commented on SOLR-8922:
------------------------------------
Wow; I'm (pleasantly) surprised to see such a general performance increase; I
thought this was just about saving memory. Why is it faster? Less GC time?
I'm confused by the benchmark and/or I don't understand the setup.
bq. 20% chance of a document missing the value for a field.
Put another way, do you mean any given term has an 80% chance of being in the
doc?
I'm confused why the number of terms that are in the field has anything to do
with the performance of this patch. Perhaps what you've done in your benchmark
is have the fields with the larger number of terms result in any given term
matching fewer documents? I think it would be far clearer to report the
performance increase over varying number of docs that were counted in the doc
set. However many terms were in the field doesn't really matter in and of
itself (I think). Couldn't you have done all this in one field and just chosen
your 50 term queries based on those terms that have the same(ish) document
frequency. It might be as a percentage of the total docs (thus making the
numbers more generally interpretable).
> DocSetCollector can allocate massive garbage on large indexes
> -------------------------------------------------------------
>
> Key: SOLR-8922
> URL: https://issues.apache.org/jira/browse/SOLR-8922
> Project: Solr
> Issue Type: Improvement
> Reporter: Jeff Wartes
> Assignee: Yonik Seeley
> Attachments: SOLR-8922.patch, SOLR-8922.patch
>
>
> After reaching a point of diminishing returns tuning the GC collector, I
> decided to take a look at where the garbage was coming from. To my surprise,
> it turned out that for my index and query set, almost 60% of the garbage was
> coming from this single line:
> https://github.com/apache/lucene-solr/blob/94c04237cce44cac1e40e1b8b6ee6a6addc001a5/solr/core/src/java/org/apache/solr/search/DocSetCollector.java#L49
> This is due to the simple fact that I have 86M documents in my shards.
> Allocating a scratch array big enough to track a result set 1/64th of my
> index (1.3M) is also almost certainly excessive, considering my 99.9th
> percentile hit count is less than 56k.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]