[
https://issues.apache.org/jira/browse/HBASE-11935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Hofhansl updated HBASE-11935:
----------------------------------
Attachment: 11935-0.98-v2.txt
This one has held up to initial testing.
Will post a trunk patch soon and report back with more actual cluster testing
tomorrow.
The core of the change is that we do schedule every single queue on a thread
(there might be 1000's or even 100's of 1000's). Instead we schedule a thread
per failed RS and handle all that RSs queue inside the same thread (by calling
ReplicationSource's run() inline).
While we looked at the code we all felt it is time to rethink the architecture
and rewrite the replication source side from scratch. It has reached the point
where incremental changes are no longer appropriate... But that is for another
jira.
> ZooKeeper connection storm after queue failover with slave cluster down
> -----------------------------------------------------------------------
>
> Key: HBASE-11935
> URL: https://issues.apache.org/jira/browse/HBASE-11935
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.99.0, 2.0.0, 0.94.23, 0.98.6
> Reporter: Lars Hofhansl
> Assignee: Jesse Yates
> Priority: Critical
> Fix For: 2.0.0, 0.98.7, 0.94.24, 0.99.1
>
> Attachments: 11935-0.98-v2.txt, hbase-11935-0.98-v0.patch,
> hbase-11935-0.98-v1.patch, hbase-11935-trunk-v0.patch,
> hbase-11935-trunk-v1.patch, hbase-11935-trunk-v2.patch
>
>
> We just ran into a production incident with TCP SYN storms on port 2181
> (zookeeper).
> In our case the slave cluster was not running. When we bounced the primary
> cluster we saw an "unbounded" number of failover threads all hammering the
> hosts on the slave ZK machines (which did not run ZK at the time)... Causing
> overall degradation of network performance between datacenters.
> Looking at the code we noticed that the thread pool handling of the Failover
> workers was probably unintended.
> Patch coming soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)