[
https://issues.apache.org/jira/browse/SOLR-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550543#comment-14550543
]
Mark Miller commented on SOLR-7511:
-----------------------------------
bq. – keep the corrupt index around in case the user wants to attempt data
recovery.
+1 on the idea - but then, this can be hundreds of gig of index - who cleans
these up? What happens when it fills the drive after a few random corruptions?
At the least, should be a configuration.
> Unable to open searcher when chaosmonkey is actively restarting solr and data
> nodes
> -----------------------------------------------------------------------------------
>
> Key: SOLR-7511
> URL: https://issues.apache.org/jira/browse/SOLR-7511
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.10.3
> Reporter: Hrishikesh Gadre
>
> I have a working chaos-monkey setup which is killing (and restarting) solr
> and data nodes in a round-robin fashion periodically. I wrote a simple Solr
> client to periodically index and query bunch of documents. After executing
> the test for some time, Solr returns incorrect number of documents. In the
> background, I see following errors,
> org.apache.solr.common.SolrException: Error opening new searcher
> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
> ... 8 more
> Caused by: java.io.EOFException: read past EOF
> at
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
> at
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
> at
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
> at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
> at
> org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
> at
> org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
> at
> org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
> at
> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
> The issue here is that the index state for one of the replica is corrupt
> (verified using Lucene CheckIndex tool). Hence Solr is not able to load the
> core on that particular instance.
> Interestingly when the other sane replica comes online, it tries to do a
> peer-sync to this failing replica and gets an error, it also moves to
> recovering state. As a result this particular shard is completely unavailable
> for read/write requests. Here is a sample log entries on this sane replica,
> Error opening new searcher,trace=org.apache.solr.common.SolrException:
> SolrCore 'customers_shard1_replica1' is not available due to init failure:
> Error opening new searcher
> at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
> at
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)
> 2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync:
> core=customers_shard1_replica2
> url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
> 2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's
> attempt to sync with shard failed, moving to the next candidate
> 2015-05-07 12:41:50,007 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: There may be a better
> leader candidate than us - going back into recovery
> 2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling
> election
> /collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
> 2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState:
> Running recovery - first canceling any ongoing recovery
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last
> recovery attempt started 2685ms ago.
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling
> recovery attempts - waiting for 7314ms
> I am able to reproduce this problem consistently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]