[
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926204#comment-15926204
]
Joshua Humphries commented on SOLR-7191:
----------------------------------------
Our cluster has many thousands of collections, most of which have only a single
shard and single replica. Restarting a single node takes over two minutes in
good circumstances (expected restart, like during upgrades of solr or
deployment of new/updated plugins). In bad circumstances, like if machines
appear wedged and leader election issues have already caused the overseer queue
to grow large, restarting a server can take over 10 minutes!
While watching the overseer queue size in our latest observation of this
slowness, I saw that the down node messages take *way* too long to process. I
ended up tracking that to an issue where it results in a ZK write for *every*
collection, not just the collections that had shard-replicas on that node. In
our case, it was processing about 40 times too many collections, making a
rolling restart of the whole cluster effectively O(n^2) instead of O(n) in
terms of the writes to ZK.
See SOLR-10277.
> Improve stability and startup performance of SolrCloud with thousands of
> collections
> ------------------------------------------------------------------------------------
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.0
> Reporter: Shawn Heisey
> Assignee: Noble Paul
> Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log,
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch,
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3,
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many
> problems myself even before I was able to get 4000 collections created on a
> 5.0 example cloud setup. Restarting Solr takes a very long time, and it is
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance
> and scalability. It doesn't help that I'm running both Solr nodes on one
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]