[ https://issues.apache.org/jira/browse/SOLR-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059138#comment-16059138 ]
Mihaly Toth commented on SOLR-7511: ----------------------------------- One strategy could be to decide based on the replication factor. If it is like 3 it should be safe enough to delete a corrupted index. If it is 1 new index should be created if there is enough space. Also, such strategy (keep old/purge/conditional) may be made configurable. The other part of the coin is how a bad node is handled from the leader candidate. Would not it make sense to close out nodes with which replication fails? > Unable to open searcher when chaosmonkey is actively restarting solr and data > nodes > ----------------------------------------------------------------------------------- > > Key: SOLR-7511 > URL: https://issues.apache.org/jira/browse/SOLR-7511 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.10.3 > Reporter: Hrishikesh Gadre > > I have a working chaos-monkey setup which is killing (and restarting) solr > and data nodes in a round-robin fashion periodically. I wrote a simple Solr > client to periodically index and query bunch of documents. After executing > the test for some time, Solr returns incorrect number of documents. In the > background, I see following errors, > org.apache.solr.common.SolrException: Error opening new searcher > at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856) > ... 8 more > Caused by: java.io.EOFException: read past EOF > at > org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186) > at > org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46) > at > org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41) > at org.apache.lucene.store.DataInput.readInt(DataInput.java:98) > at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134) > at > org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358) > at > org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454) > at > org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906) > at > org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752) > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450) > at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792) > at > org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77) > at > org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) > The issue here is that the index state for one of the replica is corrupt > (verified using Lucene CheckIndex tool). Hence Solr is not able to load the > core on that particular instance. > Interestingly when the other sane replica comes online, it tries to do a > peer-sync to this failing replica and gets an error, it also moves to > recovering state. As a result this particular shard is completely unavailable > for read/write requests. Here is a sample log entries on this sane replica, > Error opening new searcher,trace=org.apache.solr.common.SolrException: > SolrCore 'customers_shard1_replica1' is not available due to init failure: > Error opening new searcher > at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277) > 2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync: > core=customers_shard1_replica2 > url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed > 2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's > attempt to sync with shard failed, moving to the next candidate > 2015-05-07 12:41:50,007 INFO > org.apache.solr.cloud.ShardLeaderElectionContext: There may be a better > leader candidate than us - going back into recovery > 2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling > election > /collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722 > 2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState: > Running recovery - first canceling any ongoing recovery > 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last > recovery attempt started 2685ms ago. > 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling > recovery attempts - waiting for 7314ms > I am able to reproduce this problem consistently. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org