[jira] [Resolved] (SOLR-10840) Random Index Corruption during bulk indexing

Simon Rosenthal (JIRA) Thu, 09 Nov 2017 07:20:23 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Simon Rosenthal resolved SOLR-10840.
------------------------------------
    Resolution: Cannot Reproduce

After moving our production Solr server to a new AWS instance, the problem 
disappeared. Heaven knows why.

> Random Index Corruption during bulk indexing
> --------------------------------------------
>
>                 Key: SOLR-10840
>                 URL: https://issues.apache.org/jira/browse/SOLR-10840
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update
>    Affects Versions: 6.3, 6.5.1
>         Environment: AWS EC2 instance running Centos 7
>            Reporter: Simon Rosenthal
>
> I'm seeing a randomly occuring Index Corruption exception during a Solr  data 
> ingest. This can occur anywhere during the 7-8 hours our ingests take. I'm 
> initially submitting this as a Solr bug as this is the envioronment I'm 
> using, but it does look as though the error is occurring in Lucene code.
> Some background:
>     AWS EC2 server running CentOS 7
>     java.runtime.version: 1.8.0_131-b11  (also occurred with 1.8.0_45).
>     Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with 
> Solr 5.4 9which i can't go back to). Oddly enough, I ran Solr 6.3.0 
> unvenetfully for several weeks before this problem first occurred.
>     Standalone  (non cloud) environment.
>     Our indexing subsystem is a complex Python script which creates multiple 
> indexing subprocesses in order to make use of multiple cores. Each subprocess 
> reads records from a MySQL database, does  some significant preprocessing and 
> sends a batch of documents (defaults to 500) to the Solr update handler 
> (using the Python 'scorched' module). Each content source (there are 5-6) 
> requires a separate instantiation of the script, and these wrapped in a Bash 
> script to run serially.
>     
>     When the exception occurs, we always see something like the following in 
> the solr.log
>     
>     ERROR - 2017-06-06 14:37:34.639; [   x:stresstest1] 
> org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: 
> Exception writing document id med-27840-00384802 to the index; possible 
> analysis error.
>     at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178
>     ...
>     Caused by: org.apache.lucene.store.AlreadyClosedException: this 
> IndexWriter is closed
>     at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
>     at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
>     at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
>     at 
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279)
>     at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
>     at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
>     ... 42 more
> Caused by: java.io.EOFException: read past EOF: 
> MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")
>     at 
> org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
>     at 
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
>     at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
>         at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
>     at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
>     at 
> org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58)
>     at 
> org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82)
>     at 
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
>     at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
>     at 
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
>     at 
> org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
>     at 
> org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
>     at 
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
>     at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
>     at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>     Suppressed: org.apache.lucene.index.CorruptIndexException: checksum 
> status indeterminate: remaining=0, please run checkindex for more details 
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")))
>         at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
>         at 
> org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63)
>         ... 12 more
>          
>     This is usually followed in very short order by similar exceptions  as 
> other UpdateHandler threads hit the same IOException.
>     
>     I've also seenwhat I assume is a related error during autocommits-
>     
> INFO  - 2017-05-30 18:01:35.264; [   x:build0530] 
> org.apache.solr.update.DirectUpdateHandler2; start 
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
> ERROR - 2017-05-30 18:01:36.884; [   x:build0530] 
> org.apache.solr.common.SolrException; auto commit 
> error...:org.apache.solr.common.SolrException: Error opening new searcher
>     at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
>     at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
>     at 
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630)
>     at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer 
> mismatch (file truncated?): actual footer=-2060254071 vs expected 
> footer=-1071082520 
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip")))
>     at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499)
>     at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411)
>     at 
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520)
>     at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178)
>     at 
> org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
>     at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292)
>     at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
>     at 
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106)
>     at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
>     at 
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
>     at 
> org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
>     at 
> org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
>     at 
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
>     at 
> org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
>     at 
> org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
>     at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454)
>     at 
> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
>     at 
> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
>     at 
> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
>     at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731)
>     ... 10 more
>     
>     Other Observations:
>     
>         It's not assocciated with a specific Lucene index file type (the 
> 'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files)
>         I've configured merges to use the LogByteSizeMergePolicyFactory and 
> TieredMergePolicyFactory, and I see the failure with either.
>         The file system (~600gb) is never more than 50% full so disk space is 
> not an issue
>         I've seen this occur with indexes on both ext4 and xfs file systems 
> (which have been fsck'ed /repaired, and we're not seeing any hardware 
> problems reported in the system logs). These file systems are all SSDs.
>         Solr is started with a 5Gb heap and I haven't seen heap usage > 3gb; 
> also, there is no concurrent query activity during the indexing process.
>         I can recover from this error by unloading the core, running fixindex 
> (whixch reports no errors), reloading the core, and continuing indexing from 
> a checkpoint in the indexing script. 
>         
>     I've created a test rig (in Python) which can be run independently of our 
> environment and workflow, and have managed to get this to throw the exception 
> (the first stack trace above is from a run with that).
>     
>     My semi-informed guess is that this is due to a race condition between 
> segment merges and index updates...
>     
>     
>         
>     



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-10840) Random Index Corruption during bulk indexing

Reply via email to