Hello All, I'm running SolrCloud(1 shard,9 replicas) on Amazon EKS.
The other day, when I accidentally stopped CoreDNS of EKS, the entire Solr cluster went down due to the inability to resolve names of each node. I restarted CoreDNS shortly afterwards, but the Solr node just repeated down and recovering, and it did not return to the normal state automatically. During this time Solr was in a state of accepting search requests all the time, so I stopped the search request completely. After that, I executed DELETEREPLICA to reduce the number of Solr nodes to one. I increased the number of replicas little by little, and after returning to the original cluster state completely, I restarted the search request, and after that, no particular problem occurred. At the time of this failure, the JVM Threads on each node were stuck at 10000. Since the load was very high, it is probable that each node repeated down and recovering. If I reduced(or increased) this JVM Threads, would the Solr cluster automatically return to normal state? If so, what setting in sorlconfig.xml should I change to reduce(or increase) this JVM Threads? I think "maxConnectionsPerHost" and "maximumPoolSize" are related to this issue, but I'm not sure about the difference between the two. Any help would be appreciated. Thanks, Issei