[
https://issues.apache.org/jira/browse/SOLR-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simon Rosenthal resolved SOLR-10840.
------------------------------------
Resolution: Cannot Reproduce
After moving our production Solr server to a new AWS instance, the problem
disappeared. Heaven knows why.
> Random Index Corruption during bulk indexing
> --------------------------------------------
>
> Key: SOLR-10840
> URL: https://issues.apache.org/jira/browse/SOLR-10840
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: update
> Affects Versions: 6.3, 6.5.1
> Environment: AWS EC2 instance running Centos 7
> Reporter: Simon Rosenthal
>
> I'm seeing a randomly occuring Index Corruption exception during a Solr data
> ingest. This can occur anywhere during the 7-8 hours our ingests take. I'm
> initially submitting this as a Solr bug as this is the envioronment I'm
> using, but it does look as though the error is occurring in Lucene code.
> Some background:
> AWS EC2 server running CentOS 7
> java.​runtime.​version: 1.8.0_131-b11 (also occurred with 1.8.0_45).
> Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with
> Solr 5.4 9which i can't go back to). Oddly enough, I ran Solr 6.3.0
> unvenetfully for several weeks before this problem first occurred.
> Standalone (non cloud) environment.
> Our indexing subsystem is a complex Python script which creates multiple
> indexing subprocesses in order to make use of multiple cores. Each subprocess
> reads records from a MySQL database, does some significant preprocessing and
> sends a batch of documents (defaults to 500) to the Solr update handler
> (using the Python 'scorched' module). Each content source (there are 5-6)
> requires a separate instantiation of the script, and these wrapped in a Bash
> script to run serially.
>
> When the exception occurs, we always see something like the following in
> the solr.log
>
> ERROR - 2017-06-06 14:37:34.639; [ x:stresstest1]
> org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
> Exception writing document id med-27840-00384802 to the index; possible
> analysis error.
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178
> ...
> Caused by: org.apache.lucene.store.AlreadyClosedException: this
> IndexWriter is closed
> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
> at
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
> ... 42 more
> Caused by: java.io.EOFException: read past EOF:
> MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")
> at
> org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
> at
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
> at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
> at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
> at
> org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58)
> at
> org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82)
> at
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
> at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
> at
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
> at
> org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
> at
> org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
> at
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
> at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
> at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
> Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
> status indeterminate: remaining=0, please run checkindex for more details
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")))
> at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
> at
> org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63)
> ... 12 more
>
> This is usually followed in very short order by similar exceptions as
> other UpdateHandler threads hit the same IOException.
>
> I've also seenwhat I assume is a related error during autocommits-
>
> INFO - 2017-05-30 18:01:35.264; [ x:build0530]
> org.apache.solr.update.DirectUpdateHandler2; start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
> ERROR - 2017-05-30 18:01:36.884; [ x:build0530]
> org.apache.solr.common.SolrException; auto commit
> error...:org.apache.solr.common.SolrException: Error opening new searcher
> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
> at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630)
> at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer
> mismatch (file truncated?): actual footer=-2060254071 vs expected
> footer=-1071082520
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip")))
> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499)
> at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411)
> at
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520)
> at
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178)
> at
> org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
> at
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106)
> at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
> at
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
> at
> org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
> at
> org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
> at
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
> at
> org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
> at
> org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
> at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454)
> at
> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
> at
> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
> at
> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731)
> ... 10 more
>
> Other Observations:
>
> It's not assocciated with a specific Lucene index file type (the
> 'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files)
> I've configured merges to use the LogByteSizeMergePolicyFactory and
> TieredMergePolicyFactory, and I see the failure with either.
> The file system (~600gb) is never more than 50% full so disk space is
> not an issue
> I've seen this occur with indexes on both ext4 and xfs file systems
> (which have been fsck'ed /repaired, and we're not seeing any hardware
> problems reported in the system logs). These file systems are all SSDs.
> Solr is started with a 5Gb heap and I haven't seen heap usage > 3gb;
> also, there is no concurrent query activity during the indexing process.
> I can recover from this error by unloading the core, running fixindex
> (whixch reports no errors), reloading the core, and continuing indexing from
> a checkpoint in the indexing script.
>
> I've created a test rig (in Python) which can be run independently of our
> environment and workflow, and have managed to get this to throw the exception
> (the first stack trace above is from a run with that).
>
> My semi-informed guess is that this is due to a race condition between
> segment merges and index updates...
>
>
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]