[ 
https://issues.apache.org/jira/browse/SOLR-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366690#comment-15366690
 ] 

Erick Erickson commented on SOLR-7280:
--------------------------------------

bq: I don't think it takes a weird topology - just more replicas than thread to 
load them in a shard.

OK, I think I see what you're saying. You're talking about a "deep" topology, 
i.e. one with many replicas on a particular shard on a particular instance and 
I was looking at a "wide" topology, many collections per instance but each 
shard had only a few replicas. I've seen both in the field as I'm sure you 
have....

How much of both situations would be handled by creating an ordered list of all 
replicas that were leaders and loading those first then loading an ordered list 
of all replicas that weren't labeled as leader? There's still the case of a 
zillion leaders on a single instance, so some heuristic like you suggest seems 
to be in order.

I'll emphasize though that the current code (without this patch) can prevent a 
cluster from coming up at _all_. With this patch the cluster at least comes up, 
albeit slowly if the leaderVoteWait comes into play. Bumping the number of 
threads can to > the max replicas for a shard can handle the case you mentioned 
while keeping it "reasonable" can deal with the one I'm seeing.

That said, I think the default should be quite high in the cloud case so we 
don't change the current behavior and let situations like I'm seeing deal with 
configuring this. I think it defaults to 8 currently, perhaps 100 (or 
unlimited) instead in cloud mode?

How much of all of the above makes this patch "good enough for now" with 
perhaps follow-ons on more sophisticated approaches?

> Load cores in sorted order and tweak coreLoadThread counts to improve cluster 
> stability on restarts
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7280
>                 URL: https://issues.apache.org/jira/browse/SOLR-7280
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Noble Paul
>             Fix For: 5.2, 6.0
>
>         Attachments: SOLR-7280.patch
>
>
> In SOLR-7191, Damien mentioned that by loading solr cores in a sorted order 
> and tweaking some of the coreLoadThread counts, he was able to improve the 
> stability of a cluster with thousands of collections. We should explore some 
> of these changes and fold them into Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to