[jira] [Created] (SOLR-7511) Unable to open searcher when chaosmonkey is actively restarting solr and data nodes

Hrishikesh Gadre (JIRA) Thu, 07 May 2015 13:10:31 -0700

Hrishikesh Gadre created SOLR-7511:
--------------------------------------

             Summary: Unable to open searcher when chaosmonkey is actively 
restarting solr and data nodes
                 Key: SOLR-7511
                 URL: https://issues.apache.org/jira/browse/SOLR-7511
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.10.3
            Reporter: Hrishikesh Gadre



I have a working chaos-monkey setup which is killing (and restarting) solr and 
data nodes in a round-robin fashion periodically. I wrote a simple Solr client 
to periodically index and query bunch of documents. After executing the test 
for some time, Solr returns incorrect number of documents. In the background, I 
see following errors,

org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
        ... 8 more
Caused by: java.io.EOFException: read past EOF
        at 
org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
        at 
org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
        at 
org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
        at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
        at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
        at 
org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
        at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
        at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
        at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
        at 
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
        at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)

The issue here is that the index state for one of the replica is corrupt 
(verified using Lucene CheckIndex tool). Hence Solr is not able to load the 
core on that particular instance. 

Interestingly when the other sane replica comes online, it tries to do a 
peer-sync to this failing replica and gets an error, it also moves to 
recovering state. As a result this particular shard is completely unavailable 
for read/write requests. Here is a sample log entries on this sane replica,

Error opening new searcher,trace=org.apache.solr.common.SolrException: SolrCore 
'customers_shard1_replica1' is not available due to init failure: Error opening 
new searcher
        at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at 
org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
        at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
        at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)


2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync: 
core=customers_shard1_replica2 
url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's 
attempt to sync with shard failed, moving to the next candidate
2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ShardLeaderElectionContext: 
There may be a better leader candidate than us - going back into recovery
2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling 
election 
/collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState: 
Running recovery - first canceling any ongoing recovery
2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last 
recovery attempt started 2685ms ago.
2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling 
recovery attempts - waiting for 7314ms

I am able to reproduce this problem consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-7511) Unable to open searcher when chaosmonkey is actively restarting solr and data nodes

Reply via email to