Hello,

We are currently performing some benchmarks on Solr 5.4.0 and we hit some 
issues related to SolrCloud and leading to recoveries and inconstancies.
Based on our tests, it seems that this version is less stable under pressure 
than our previously installed 4.10.4 version.
We were able to mitigate the effects by increasing numRecordsToKeep in the 
update log and limiting replication bandwidth.
But all problems were not resolved and more worrying it is more difficult to 
get back a running cluster.
For example we ended up with a situation where on a shard the leader is down 
and all replicas are active.

We found a particular pattern that leads to a bad cluster state, described 
here: 
https://issues.apache.org/jira/browse/SOLR-8129?focusedCommentId=15119905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15119905

There are also a lot of open issues (or resolved in version 5.5) related to 
SolrCloud / Zookeeper / Replications.

Here is a (non exhaustive) list I could gather from JIRA:





SOLR-8129<https://issues.apache.org/jira/browse/SOLR-8129>


HdfsChaosMonkeyNothingIsSafeTest 
failures<https://issues.apache.org/jira/browse/SOLR-8129>

SOLR-8461<https://issues.apache.org/jira/browse/SOLR-8461>


CloudSolrStream and ParallelStream can choose replicas that are not 
active<https://issues.apache.org/jira/browse/SOLR-8461>

SOLR-8619<https://issues.apache.org/jira/browse/SOLR-8619>


A new replica should not become leader when all current replicas are down as it 
leads to data loss<https://issues.apache.org/jira/browse/SOLR-8619>

SOLR-3274<https://issues.apache.org/jira/browse/SOLR-3274>


ZooKeeper related SolrCloud 
problems<https://issues.apache.org/jira/browse/SOLR-3274>

SOLR-6406<https://issues.apache.org/jira/browse/SOLR-6406>


ConcurrentUpdateSolrServer hang in 
blockUntilFinished.<https://issues.apache.org/jira/browse/SOLR-6406>

SOLR-8173<https://issues.apache.org/jira/browse/SOLR-8173> CLONE - Leader 
recovery process can select the wrong leader if all replicas for a shard are 
down and trying to recover as well as lose updates that should have been 
recovered.<https://issues.apache.org/jira/browse/SOLR-8173>
SOLR-8371<https://issues.apache.org/jira/browse/SOLR-8371>


Try and prevent too many recovery requests from stacking up and clean up some 
faulty logic.<https://issues.apache.org/jira/browse/SOLR-8371>

SOLR-7121<https://issues.apache.org/jira/browse/SOLR-7121>


Solr nodes should go down based on configurable thresholds and not rely on 
resource exhaustion<https://issues.apache.org/jira/browse/SOLR-7121>

SOLR-8586<https://issues.apache.org/jira/browse/SOLR-8586>


Implement hash over all documents to check for shard 
synchronization<https://issues.apache.org/jira/browse/SOLR-8586>



I wonder if all these issues could be treated in a general refactoring of this 
code instead of individual patches for every issue.
I know that these issues are not easy to reproduce and debug and I'm not aware 
of all the implications of this kind of work.
We are willing to contribute on this issues although our knowledge of Solr 
internal might still be weak for such an important part of SolrCloud 
architecture.
We can provide logs and benchmarks that lead to inconsistencies and/or bad 
cluster states.
It appears with have a better behaviour when we have a 5 nodes zk cluster than 
a 3 nodes.
However there are no sign of any problems on ZK when we have these errors in 
Solr.

Regards,
Stephan


Reply via email to