jefferyyuan created LUCENE-8662:
-----------------------------------
Summary: Override seekExact(BytesRef) in
FilterLeafReader.FilterTermsEnum
Key: LUCENE-8662
URL: https://issues.apache.org/jira/browse/LUCENE-8662
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Affects Versions: 7.6, 6.6.5, 5.5.5, 8.0
Reporter: jefferyyuan
Fix For: 8.0, 7.7
Recently in our production, we found that Sole uses a lot of memory(more than
10g) during recovery or commit for a small index (3.5gb)
The stack trace is:
{code:java}
Thread 0x4d4b115c0
at org.apache.lucene.store.DataInput.readVInt()I (DataInput.java:125)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V
(SegmentTermsEnumFrame.java:157)
at
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermNonLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
(SegmentTermsEnumFrame.java:786)
at
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTerm(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
(SegmentTermsEnumFrame.java:538)
at
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
(SegmentTermsEnum.java:757)
at
org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
(FilterLeafReader.java:185)
at
org.apache.lucene.index.TermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
(TermsEnum.java:74)
at
org.apache.solr.search.SolrIndexSearcher.lookupId(Lorg/apache/lucene/util/BytesRef;)J
(SolrIndexSearcher.java:823)
at
org.apache.solr.update.VersionInfo.getVersionFromIndex(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
(VersionInfo.java:204)
at
org.apache.solr.update.UpdateLog.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
(UpdateLog.java:786)
at
org.apache.solr.update.VersionInfo.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
(VersionInfo.java:194)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Lorg/apache/solr/update/AddUpdateCommand;)Z
(DistributedUpdateProcessor.java:1051)
{code}
We reproduced the problem locally with the following code using Lucene code.
{code:java}
public static void main(String[] args) throws IOException {
FSDirectory index = FSDirectory.open(Paths.get("the-index"));
try (IndexReader reader = new
ExitableDirectoryReader(DirectoryReader.open(index),
new QueryTimeoutImpl(1000 * 60 * 5))) {
String id = "the-id";
BytesRef text = new BytesRef(id);
for (LeafReaderContext lf : reader.leaves()) {
TermsEnum te = lf.reader().terms("id").iterator();
System.out.println(te.seekExact(text));
}
}
}
{code}
We found out the root cause: we didn't implement seekExact(BytesRef) method in
FilterLeafReader.FilterTerms, so it uses the base class
TermsEnum.seekExact(BytesRef) implementation which is very inefficient in this
case.
{code:java}
public boolean seekExact(BytesRef text) throws IOException {
return seekCeil(text) == SeekStatus.FOUND;
}
{code}
The fix is simple, just override seekExact(BytesRef) method in
FilterLeafReader.FilterTerms
{code:java}
@Override
public boolean seekExact(BytesRef text) throws IOException {
return in.seekExact(text);
}
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]