[jira] [Updated] (HBASE-11935) ZooKeeper connection storm after queue failover with slave cluster down

Lars Hofhansl (JIRA) Thu, 11 Sep 2014 22:02:00 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-11935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lars Hofhansl updated HBASE-11935:
----------------------------------
    Attachment: 11935-0.98-v2.txt

This one has held up to initial testing.
Will post a trunk patch soon and report back with more actual cluster testing 
tomorrow.

The core of the change is that we do schedule every single queue on a thread 
(there might be 1000's or even 100's of 1000's). Instead we schedule a thread 
per failed RS and handle all that RSs queue inside the same thread (by calling 
ReplicationSource's run() inline).

While we looked at the code we all felt it is time to rethink the architecture 
and rewrite the replication source side from scratch. It has reached the point 
where incremental changes are no longer appropriate... But that is for another 
jira.

> ZooKeeper connection storm after queue failover with slave cluster down
> -----------------------------------------------------------------------
>
>                 Key: HBASE-11935
>                 URL: https://issues.apache.org/jira/browse/HBASE-11935
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.99.0, 2.0.0, 0.94.23, 0.98.6
>            Reporter: Lars Hofhansl
>            Assignee: Jesse Yates
>            Priority: Critical
>             Fix For: 2.0.0, 0.98.7, 0.94.24, 0.99.1
>
>         Attachments: 11935-0.98-v2.txt, hbase-11935-0.98-v0.patch, 
> hbase-11935-0.98-v1.patch, hbase-11935-trunk-v0.patch, 
> hbase-11935-trunk-v1.patch, hbase-11935-trunk-v2.patch
>
>
> We just ran into a production incident with TCP SYN storms on port 2181 
> (zookeeper).
> In our case the slave cluster was not running. When we bounced the primary 
> cluster we saw an "unbounded" number of failover threads all hammering the 
> hosts on the slave ZK machines (which did not run ZK at the time)... Causing 
> overall degradation of network performance between datacenters.
> Looking at the code we noticed that the thread pool handling of the Failover 
> workers was probably unintended.
> Patch coming soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-11935) ZooKeeper connection storm after queue failover with slave cluster down

Reply via email to