[ 
https://issues.apache.org/jira/browse/SOLR-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059138#comment-16059138
 ] 

Mihaly Toth commented on SOLR-7511:
-----------------------------------

One strategy could be to decide based on the replication factor. If it is like 
3 it should be safe enough to delete a corrupted index. If it is 1 new index 
should be created if there is enough space.

Also, such strategy (keep old/purge/conditional) may be made configurable. 

The other part of the coin is how a bad node is handled from the leader 
candidate. Would not it make sense to close out nodes with which replication 
fails?


> Unable to open searcher when chaosmonkey is actively restarting solr and data 
> nodes
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-7511
>                 URL: https://issues.apache.org/jira/browse/SOLR-7511
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.3
>            Reporter: Hrishikesh Gadre
>
> I have a working chaos-monkey setup which is killing (and restarting) solr 
> and data nodes in a round-robin fashion periodically. I wrote a simple Solr 
> client to periodically index and query bunch of documents. After executing 
> the test for some time, Solr returns incorrect number of documents. In the 
> background, I see following errors,
> org.apache.solr.common.SolrException: Error opening new searcher
>         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
>         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
>         ... 8 more
> Caused by: java.io.EOFException: read past EOF
>         at 
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
>         at 
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
>         at 
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
>         at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
>         at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
>         at 
> org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
>         at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
>         at 
> org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
>         at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
>         at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
>         at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
>         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
>         at 
> org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
>         at 
> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
> The issue here is that the index state for one of the replica is corrupt 
> (verified using Lucene CheckIndex tool). Hence Solr is not able to load the 
> core on that particular instance. 
> Interestingly when the other sane replica comes online, it tries to do a 
> peer-sync to this failing replica and gets an error, it also moves to 
> recovering state. As a result this particular shard is completely unavailable 
> for read/write requests. Here is a sample log entries on this sane replica,
> Error opening new searcher,trace=org.apache.solr.common.SolrException: 
> SolrCore 'customers_shard1_replica1' is not available due to init failure: 
> Error opening new searcher
>         at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>         at 
> org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
>         at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
>         at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)
> 2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync: 
> core=customers_shard1_replica2 
> url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
> 2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's 
> attempt to sync with shard failed, moving to the next candidate
> 2015-05-07 12:41:50,007 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: There may be a better 
> leader candidate than us - going back into recovery
> 2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling 
> election 
> /collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
> 2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState: 
> Running recovery - first canceling any ongoing recovery
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last 
> recovery attempt started 2685ms ago.
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling 
> recovery attempts - waiting for 7314ms
> I am able to reproduce this problem consistently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to