[ 
https://issues.apache.org/jira/browse/SOLR-8973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241188#comment-15241188
 ] 

Shalin Shekhar Mangar commented on SOLR-8973:
---------------------------------------------

Thanks for the bug report, [~janmejay]! There's definitely a race here but 
instead of waiting for the collection to be visible (which has its own problem 
that I'll describe later), we can simply call 
`ZkStateReader.addCollectionWatch` anyway (without checking if the collection 
exists already) which will force ZkStateReader to fetch the collection state 
from Zk and cache it.

The problem with waiting for the collection as done in this patch is that it is 
allowed for collections to be created using the core admin API directly i.e. 
the collection is created in ZK by the Overseer when the core publishes its 
state. So in such cases, you will see spurious waits.

> TX-frenzy on Zookeeper when collection is put to use
> ----------------------------------------------------
>
>                 Key: SOLR-8973
>                 URL: https://issues.apache.org/jira/browse/SOLR-8973
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, master, 5.6
>            Reporter: Janmejay Singh
>            Assignee: Shalin Shekhar Mangar
>              Labels: collections, patch-available, solrcloud, zookeeper
>         Attachments: SOLR-8973.patch
>
>
> This is to do with a distributed data-race. Core-creation happens at a time 
> when collection is not yet visible to the node. In this case a fallback 
> code-path is used which de-references collection-state lazily (on demand) as 
> opposed to setting a watch and keeping it cached locally.
> Due to this, as requests towards the core mount, it generates ZK fetch for 
> collection proportionately. On a large solr-cloud cluster, this generates 
> several Gbps of TX traffic on ZK nodes. This affects indexing 
> throughput(which floors) in addition to running ZK node out of network 
> bandwidth. 
> On smaller solr-cloud clusters its hard to run into, because probability of 
> this race materializing reduces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to