[
https://issues.apache.org/jira/browse/LUCENE-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-8040:
---------------------------------
Attachment: lucenecollectionStatisticsbench.zip
I considered a few alternatives to this today using JMH:
* Cache MultiFields on the IndexSearcher
* Compute the CollectionStatics raw, immediately (don't lookup or cache)
* Add a ConcurrentHashMap<String,CollectionStatistics> on the IndexSearcher and
compute on demand.
Attached is the JMH benchmark program. Between runs I would change line 78 to
call out to the impl I wanted to try. JMH Main method is
"org.openjdk.jmh.Main" and I used args "-wi 5 -i 5 -f 1"
My annotated results are:
{noformat}
Result "dsmiley.MyBenchmark.bench": IndexSearcher
1146.739 ±(99.9%) 280.645 us/op [Average]
(min, avg, max) = (1034.410, 1146.739, 1238.123), stdev = 72.883
CI (99.9%): [866.094, 1427.385] (assumes normal distribution)
Result "dsmiley.MyBenchmark.bench": cached MultiFields
29.556 ±(99.9%) 8.929 us/op [Average]
(min, avg, max) = (27.409, 29.556, 33.424), stdev = 2.319
CI (99.9%): [20.626, 38.485] (assumes normal distribution)
Result "dsmiley.MyBenchmark.bench": raw compute
951.494 ±(99.9%) 182.555 us/op [Average]
(min, avg, max) = (904.328, 951.494, 1024.473), stdev = 47.409
CI (99.9%): [768.940, 1134.049] (assumes normal distribution)
Result "dsmiley.MyBenchmark.bench": ConcurrentHashMap
4.448 ±(99.9%) 1.268 us/op [Average]
(min, avg, max) = (4.090, 4.448, 4.860), stdev = 0.329
CI (99.9%): [3.180, 5.717] (assumes normal distribution)
For 5 fields:
raw: 10.716
ConcurrentHashMap: 0.155 us/op
{noformat}
I think the results are pretty clear that we should go with the
ConcurrentHashMap.
I'm aware my benchmark implementation of this needs some more work. If an
IOException is thrown it should pass through without RuntimeException wrapper.
And if the field doesn't exist, we want to return null.
> Optimize IndexSearcher.collectionStatistics
> -------------------------------------------
>
> Key: LUCENE-8040
> URL: https://issues.apache.org/jira/browse/LUCENE-8040
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: David Smiley
> Assignee: David Smiley
> Fix For: 7.2
>
> Attachments: lucenecollectionStatisticsbench.zip
>
>
> {{IndexSearcher.collectionStatistics(field)}} can do a fair amount of work
> because with each invocation it will call {{MultiFields.getTerms(...)}}. The
> effects of this are aggravated for queries with many fields since each field
> will want statistics, and also aggravated when there are many segments.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]