[ https://issues.apache.org/jira/browse/SOLR-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Simon Rosenthal resolved SOLR-10840. ------------------------------------ Resolution: Cannot Reproduce After moving our production Solr server to a new AWS instance, the problem disappeared. Heaven knows why. > Random Index Corruption during bulk indexing > -------------------------------------------- > > Key: SOLR-10840 > URL: https://issues.apache.org/jira/browse/SOLR-10840 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: update > Affects Versions: 6.3, 6.5.1 > Environment: AWS EC2 instance running Centos 7 > Reporter: Simon Rosenthal > > I'm seeing a randomly occuring Index Corruption exception during a Solr data > ingest. This can occur anywhere during the 7-8 hours our ingests take. I'm > initially submitting this as a Solr bug as this is the envioronment I'm > using, but it does look as though the error is occurring in Lucene code. > Some background: > AWS EC2 server running CentOS 7 > java.​runtime.​version: 1.8.0_131-b11 (also occurred with 1.8.0_45). > Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with > Solr 5.4 9which i can't go back to). Oddly enough, I ran Solr 6.3.0 > unvenetfully for several weeks before this problem first occurred. > Standalone (non cloud) environment. > Our indexing subsystem is a complex Python script which creates multiple > indexing subprocesses in order to make use of multiple cores. Each subprocess > reads records from a MySQL database, does some significant preprocessing and > sends a batch of documents (defaults to 500) to the Solr update handler > (using the Python 'scorched' module). Each content source (there are 5-6) > requires a separate instantiation of the script, and these wrapped in a Bash > script to run serially. > > When the exception occurs, we always see something like the following in > the solr.log > > ERROR - 2017-06-06 14:37:34.639; [ x:stresstest1] > org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: > Exception writing document id med-27840-00384802 to the index; possible > analysis error. > at > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178 > ... > Caused by: org.apache.lucene.store.AlreadyClosedException: this > IndexWriter is closed > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740) > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754) > at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558) > at > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279) > at > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211) > at > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166) > ... 42 more > Caused by: java.io.EOFException: read past EOF: > MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm") > at > org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75) > at > org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41) > at org.apache.lucene.store.DataInput.readInt(DataInput.java:101) > at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194) > at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255) > at > org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58) > at > org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82) > at > org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113) > at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74) > at > org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) > at > org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384) > at > org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416) > at > org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261) > at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068) > at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > Suppressed: org.apache.lucene.index.CorruptIndexException: checksum > status indeterminate: remaining=0, please run checkindex for more details > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm"))) > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451) > at > org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63) > ... 12 more > > This is usually followed in very short order by similar exceptions as > other UpdateHandler threads hit the same IOException. > > I've also seenwhat I assume is a related error during autocommits- > > INFO - 2017-05-30 18:01:35.264; [ x:build0530] > org.apache.solr.update.DirectUpdateHandler2; start > commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false} > ERROR - 2017-05-30 18:01:36.884; [ x:build0530] > org.apache.solr.common.SolrException; auto commit > error...:org.apache.solr.common.SolrException: Error opening new searcher > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630) > at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.lucene.index.CorruptIndexException: codec footer > mismatch (file truncated?): actual footer=-2060254071 vs expected > footer=-1071082520 > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip"))) > at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499) > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411) > at > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178) > at > org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372) > at > org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106) > at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74) > at > org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) > at > org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384) > at > org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416) > at > org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261) > at > org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413) > at > org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399) > at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454) > at > org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291) > at > org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276) > at > org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235) > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731) > ... 10 more > > Other Observations: > > It's not assocciated with a specific Lucene index file type (the > 'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files) > I've configured merges to use the LogByteSizeMergePolicyFactory and > TieredMergePolicyFactory, and I see the failure with either. > The file system (~600gb) is never more than 50% full so disk space is > not an issue > I've seen this occur with indexes on both ext4 and xfs file systems > (which have been fsck'ed /repaired, and we're not seeing any hardware > problems reported in the system logs). These file systems are all SSDs. > Solr is started with a 5Gb heap and I haven't seen heap usage > 3gb; > also, there is no concurrent query activity during the indexing process. > I can recover from this error by unloading the core, running fixindex > (whixch reports no errors), reloading the core, and continuing indexing from > a checkpoint in the indexing script. > > I've created a test rig (in Python) which can be run independently of our > environment and workflow, and have managed to get this to throw the exception > (the first stack trace above is from a run with that). > > My semi-informed guess is that this is due to a race condition between > segment merges and index updates... > > > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org