[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353247#comment-15353247
 ] 

Scott Blum commented on SOLR-7191:
----------------------------------

This may be unrelated to the current patch work, but seems relevant to the uber 
ticket.:

I rebooted our solr cluster the other night to pick up an update, and I ran 
into what seemed to be pathological behavior around state updates.  My first 
attempt to bring up everything at once resulted in utter deadlock, so I shut 
everything down, manually nuked all the overseer queues/maps in ZK, and started 
bringing them up one at a time.  What I saw was kind of astounding.

I was monitoring OVERSEERSTATUS and tracking the number of outstanding overseer 
ops + the total number of update_state ops, and I noticed that every VM I 
brought up needed ~4000 update_state ops to stabilize, despite the fact that 
each VM only manages ~128 cores.  We have 32 vms with ~128 cores each, or ~4096 
cores in our entire cluster... it took over 100,000 update_state operations to 
bring the whole cluster up.  That seems... insane.  3 or 4 update_state ops per 
core would seem reasonable to me, but I saw over 30 ops per core loaded as I 
went.  This number was extremely consistent for every node I brought up.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>            Assignee: Shalin Shekhar Mangar
>              Labels: performance, scalability
>         Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to