[ 
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley reassigned LUCENE-8980:
------------------------------------

    Assignee: David Smiley

> Optimise SegmentTermsEnum.seekExact performance
> -----------------------------------------------
>
>                 Key: LUCENE-8980
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8980
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.2
>            Reporter: Guoqiang Jiang
>            Assignee: David Smiley
>            Priority: Major
>              Labels: performance
>             Fix For: master (9.0)
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, each document has an _id field that uniquely identifies it, 
> which is indexed so that documents can be looked up from Lucene. When users 
> write Elasticsearch with self-generated _id values, even if the conflict rate 
> is very low, Elasticsearch has to check _id uniqueness through Lucene API for 
> each document, which result in poor write performance.
>  
> *Solution*
> 1. Choose a better _id generator before writing ES
> Different _id formats have a great impact on write performance. We have 
> verified this in production cluster. Users can refer to the following blog 
> and choose a better _id generator.
> [http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html]
> 2. Optimise with min/maxTerm metrics in Lucene
> As Lucene stores min/maxTerm metrics for each segment and field, we can use 
> those metrics to optimise performance of Lucene look up API. When calling 
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
> whether the term fall in the range of minTerm and maxTerm, so that wo skip 
> some useless segments as soon as possible.
>  
> *Tests*
> I have made some write benchmark using _id in UUID V1 format, and the 
> benchmark result is as follows:
> ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed 
> after 8h||CPU cost||Overall improvement||
> |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
> |Optimised Lucene|34.5w/s
> (+15.4%)|63.8
> (-6.7%)|+22.1%|31.5w/s
> (18.0%)|61.5
> (-7.7%)|+25.7%|
> As shown above, after 8 hours of continuous writing, write speed improves by 
> 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
> The Elasticsearch GET API and ids query would get similar performance 
> improvements.
> It should be noted that the benchmark test needs to be run several hours 
> continuously, because the performance improvements is not obvious when the 
> data is completely cached or the number of segments is too small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to