Hrishikesh Gadre created SOLR-7511:
--------------------------------------
Summary: Unable to open searcher when chaosmonkey is actively
restarting solr and data nodes
Key: SOLR-7511
URL: https://issues.apache.org/jira/browse/SOLR-7511
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.10.3
Reporter: Hrishikesh Gadre
I have a working chaos-monkey setup which is killing (and restarting) solr and
data nodes in a round-robin fashion periodically. I wrote a simple Solr client
to periodically index and query bunch of documents. After executing the test
for some time, Solr returns incorrect number of documents. In the background, I
see following errors,
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
... 8 more
Caused by: java.io.EOFException: read past EOF
at
org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
at
org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
at
org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
at
org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
at
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
The issue here is that the index state for one of the replica is corrupt
(verified using Lucene CheckIndex tool). Hence Solr is not able to load the
core on that particular instance.
Interestingly when the other sane replica comes online, it tries to do a
peer-sync to this failing replica and gets an error, it also moves to
recovering state. As a result this particular shard is completely unavailable
for read/write requests. Here is a sample log entries on this sane replica,
Error opening new searcher,trace=org.apache.solr.common.SolrException: SolrCore
'customers_shard1_replica1' is not available due to init failure: Error opening
new searcher
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
at
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
at
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)
2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync:
core=customers_shard1_replica2
url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's
attempt to sync with shard failed, moving to the next candidate
2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ShardLeaderElectionContext:
There may be a better leader candidate than us - going back into recovery
2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling
election
/collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState:
Running recovery - first canceling any ongoing recovery
2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last
recovery attempt started 2685ms ago.
2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling
recovery attempts - waiting for 7314ms
I am able to reproduce this problem consistently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]