[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley reassigned LUCENE-8980: ------------------------------------ Assignee: David Smiley > Optimise SegmentTermsEnum.seekExact performance > ----------------------------------------------- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 8.2 > Reporter: Guoqiang Jiang > Assignee: David Smiley > Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, each document has an _id field that uniquely identifies it, > which is indexed so that documents can be looked up from Lucene. When users > write Elasticsearch with self-generated _id values, even if the conflict rate > is very low, Elasticsearch has to check _id uniqueness through Lucene API for > each document, which result in poor write performance. > > *Solution* > 1. Choose a better _id generator before writing ES > Different _id formats have a great impact on write performance. We have > verified this in production cluster. Users can refer to the following blog > and choose a better _id generator. > [http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html] > 2. Optimise with min/maxTerm metrics in Lucene > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > > *Tests* > I have made some write benchmark using _id in UUID V1 format, and the > benchmark result is as follows: > ||Branch||Write speed after 4h||CPU cost||Overall improvement||Write speed > after 8h||CPU cost||Overall improvement|| > |Original Lucene|29.9w/s|68.4%|N/A|26.7w/s|66.6%|N/A| > |Optimised Lucene|34.5w/s > (+15.4%)|63.8 > (-6.7%)|+22.1%|31.5w/s > (18.0%)|61.5 > (-7.7%)|+25.7%| > As shown above, after 8 hours of continuous writing, write speed improves by > 18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. > The Elasticsearch GET API and ids query would get similar performance > improvements. > It should be noted that the benchmark test needs to be run several hours > continuously, because the performance improvements is not obvious when the > data is completely cached or the number of segments is too small. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org