[jira] [Updated] (LUCENE-8040) Optimize IndexSearcher.collectionStatistics

David Smiley (JIRA) Mon, 06 Nov 2017 10:59:16 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Smiley updated LUCENE-8040:
---------------------------------
    Attachment: lucenecollectionStatisticsbench.zip

I considered a few alternatives to this today using JMH:
* Cache MultiFields on the IndexSearcher
* Compute the CollectionStatics raw, immediately (don't lookup or cache)
* Add a ConcurrentHashMap<String,CollectionStatistics> on the IndexSearcher and 
compute on demand.

Attached is the JMH benchmark program.  Between runs I would change line 78 to 
call out to the impl I wanted to try.  JMH Main method is 
"org.openjdk.jmh.Main" and I used args "-wi 5 -i 5 -f 1"

My annotated results are:
{noformat}
Result "dsmiley.MyBenchmark.bench":    IndexSearcher
  1146.739 ±(99.9%) 280.645 us/op [Average]
  (min, avg, max) = (1034.410, 1146.739, 1238.123), stdev = 72.883
  CI (99.9%): [866.094, 1427.385] (assumes normal distribution)

Result "dsmiley.MyBenchmark.bench":    cached MultiFields
  29.556 ±(99.9%) 8.929 us/op [Average]
  (min, avg, max) = (27.409, 29.556, 33.424), stdev = 2.319
  CI (99.9%): [20.626, 38.485] (assumes normal distribution)

Result "dsmiley.MyBenchmark.bench":    raw compute 
  951.494 ±(99.9%) 182.555 us/op [Average]
  (min, avg, max) = (904.328, 951.494, 1024.473), stdev = 47.409
  CI (99.9%): [768.940, 1134.049] (assumes normal distribution)

Result "dsmiley.MyBenchmark.bench":   ConcurrentHashMap
  4.448 ±(99.9%) 1.268 us/op [Average]
  (min, avg, max) = (4.090, 4.448, 4.860), stdev = 0.329
  CI (99.9%): [3.180, 5.717] (assumes normal distribution)


For 5 fields:
raw:               10.716
ConcurrentHashMap:  0.155 us/op
{noformat}

I think the results are pretty clear that we should go with the 
ConcurrentHashMap.  

I'm aware my benchmark implementation of this needs some more work.  If an 
IOException is thrown it should pass through without RuntimeException wrapper.  
And if the field doesn't exist, we want to return null.

> Optimize IndexSearcher.collectionStatistics
> -------------------------------------------
>
>                 Key: LUCENE-8040
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8040
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 7.2
>
>         Attachments: lucenecollectionStatisticsbench.zip
>
>
> {{IndexSearcher.collectionStatistics(field)}} can do a fair amount of work 
> because with each invocation it will call {{MultiFields.getTerms(...)}}.  The 
> effects of this are aggravated for queries with many fields since each field 
> will want statistics, and also aggravated when there are many segments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-8040) Optimize IndexSearcher.collectionStatistics

Reply via email to