[
https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guoqiang Jiang updated LUCENE-8980:
-----------------------------------
Description:
*Description*
In Elasticsearch, which is based on Lucene, each document has an _id field that
uniquely identifies it, which is indexed so that documents can be looked up
from Lucene. When users write Elasticsearch with self-generated _id values,
even if the conflict rate is very low, Elasticsearch has to check _id
uniqueness through Lucene API for each document, which result in poor write
performance.
*Solution*
As Lucene stores min/maxTerm metrics for each segment and field, we can use
those metrics to optimise performance of Lucene look up API. When calling
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check
whether the term fall in the range of minTerm and maxTerm, so that wo skip some
useless segments as soon as possible.
was:
*Description*
In Elasticsearch, each document has an _id field that uniquely identifies it,
which is indexed so that documents can be looked up from Lucene. When users
write Elasticsearch with self-generated _id values, even if the conflict rate
is very low, Elasticsearch has to check _id uniqueness through Lucene API for
each document, which result in poor write performance.
*Solution*
1. Choose a better _id generator before writing ES
Different _id formats have a great impact on write performance. We have
verified this in production cluster. Users can refer to the following blog and
choose a better _id generator.
[http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html]
2. Optimise with min/maxTerm metrics in Lucene
As Lucene stores min/maxTerm metrics for each segment and field, we can use
those metrics to optimise performance of Lucene look up API. When calling
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check
whether the term fall in the range of minTerm and maxTerm, so that wo skip some
useless segments as soon as possible.
*Tests*
I have made some write benchmark using _id in UUID V1 format, and the benchmark
result is as follows:
||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed
after 8h||CPU cost||Overall improvement||
|Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A|
|Optimised Lucene|34.5w/s
(+15.4%)|63.8
(-6.7%)|+22.1%|31.5w/s
(18.0%)|61.5
(-7.7%)|+25.7%|
As shown above, after 8 hours of continuous writing, write speed improves by
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%.
The Elasticsearch GET API and ids query would get similar performance
improvements.
It should be noted that the benchmark test needs to be run several hours
continuously, because the performance improvements is not obvious when the data
is completely cached or the number of segments is too small.
> Optimise SegmentTermsEnum.seekExact performance
> -----------------------------------------------
>
> Key: LUCENE-8980
> URL: https://issues.apache.org/jira/browse/LUCENE-8980
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 8.2
> Reporter: Guoqiang Jiang
> Assignee: David Wayne Smiley
> Priority: Major
> Labels: performance
> Fix For: master (9.0)
>
> Time Spent: 3h 50m
> Remaining Estimate: 0h
>
> *Description*
> In Elasticsearch, which is based on Lucene, each document has an _id field
> that uniquely identifies it, which is indexed so that documents can be looked
> up from Lucene. When users write Elasticsearch with self-generated _id
> values, even if the conflict rate is very low, Elasticsearch has to check _id
> uniqueness through Lucene API for each document, which result in poor write
> performance.
>
> *Solution*
> As Lucene stores min/maxTerm metrics for each segment and field, we can use
> those metrics to optimise performance of Lucene look up API. When calling
> SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check
> whether the term fall in the range of minTerm and maxTerm, so that wo skip
> some useless segments as soon as possible.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]