[
https://issues.apache.org/jira/browse/SOLR-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533302#comment-14533302
]
Hrishikesh Gadre commented on SOLR-7511:
----------------------------------------
I think we can improve the error handling in Solr in following ways
(a) If Solr is unable to load the core due to index corruption, we can still
allow the core loading to succeed (but may be in recovery mode i.e. unavailable
for read/write requests).
(b) If Solr with such corrupt index gets an peer-sync request, it could inform
the other replica that it's index state is corrupt so that the other replica
can take over the leader role and send the updates back to the first one.
Any thoughts?
> Unable to open searcher when chaosmonkey is actively restarting solr and data
> nodes
> -----------------------------------------------------------------------------------
>
> Key: SOLR-7511
> URL: https://issues.apache.org/jira/browse/SOLR-7511
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.10.3
> Reporter: Hrishikesh Gadre
>
> I have a working chaos-monkey setup which is killing (and restarting) solr
> and data nodes in a round-robin fashion periodically. I wrote a simple Solr
> client to periodically index and query bunch of documents. After executing
> the test for some time, Solr returns incorrect number of documents. In the
> background, I see following errors,
> org.apache.solr.common.SolrException: Error opening new searcher
> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
> ... 8 more
> Caused by: java.io.EOFException: read past EOF
> at
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
> at
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
> at
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
> at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
> at
> org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
> at
> org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
> at
> org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
> at
> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
> The issue here is that the index state for one of the replica is corrupt
> (verified using Lucene CheckIndex tool). Hence Solr is not able to load the
> core on that particular instance.
> Interestingly when the other sane replica comes online, it tries to do a
> peer-sync to this failing replica and gets an error, it also moves to
> recovering state. As a result this particular shard is completely unavailable
> for read/write requests. Here is a sample log entries on this sane replica,
> Error opening new searcher,trace=org.apache.solr.common.SolrException:
> SolrCore 'customers_shard1_replica1' is not available due to init failure:
> Error opening new searcher
> at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
> at
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)
> 2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync:
> core=customers_shard1_replica2
> url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
> 2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's
> attempt to sync with shard failed, moving to the next candidate
> 2015-05-07 12:41:50,007 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: There may be a better
> leader candidate than us - going back into recovery
> 2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling
> election
> /collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
> 2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState:
> Running recovery - first canceling any ongoing recovery
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last
> recovery attempt started 2685ms ago.
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling
> recovery attempts - waiting for 7314ms
> I am able to reproduce this problem consistently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]