[jira] [Commented] (ZOOKEEPER-1049) Session expire/close flooding renders heartbeats to delay significantly

Chang Song (JIRA) Fri, 15 Apr 2011 23:30:49 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020564#comment-13020564
 ]


Chang Song commented on ZOOKEEPER-1049:
---------------------------------------

bq. 1) are you running in a virtualized environment?

NO

{quote}
2) are you co-locating other services on the same host(s) that make up
the ZK serving cluster?
{quote}

NO. Dedicated ZK ensemble

{quote}
3) have you followed the admin guide's "things to avoid"?
http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
In particular ensuring that you are not swapping or going into gc
pause (both on the server and the client)
a) try turning on GC logging and ensure that you are not going into GC
pause, see the troubleshooting guide, this is the most common cause of
high latency for the clients
{quote}

No full GC (jstat output)

{quote}
b) ensure that you are not swapping
{quote}

No swapping. No significant IO (0-3% iowait utilization)


{quote}
c) ensure that other processes are not causing log writing
(transactional logging) to be slow.
{quote}

No other processes are running on these hosts.



> Session expire/close flooding renders heartbeats to delay significantly
> -----------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1049
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1049
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.2
>         Environment: CentOS 5.3, three node ZK ensemble
>            Reporter: Chang Song
>            Priority: Critical
>         Attachments: zk_ping_latency.pdf
>
>
> Let's say we have 100 clients (group A) already connected to three-node ZK 
> ensemble with session timeout of 15 second.  And we have 1000 clients (group 
> B) already connected to the same ZK ensemble, all watching several nodes 
> (with 15 second session timeout)
> Consider a case in which All clients in group B suddenly hung or deadlocked 
> (JVM OOME) all at the same time. 15 seconds later, all sessions in group B 
> gets expired, creating session closing stampede. Depending on the number of 
> this clients in group B, all request/response ZK ensemble should process get 
> delayed up to 8 seconds (1000 clients we have tested).
> This delay causes some clients in group A their sessions expired due to delay 
> in getting heartbeat response. This causes normal servers to drop out of 
> clusters. This is a serious problem in our installation, since some of our 
> services running batch servers or CI servers creating the same scenario as 
> above almost everyday.
> I am attaching a graph showing ping response time delay.
> I think ordering of creating/closing sessions and ping exchange isn't 
> important (quorum state machine). at least ping request / response should be 
> handle independently (different queue and different thread) to keep 
> realtime-ness of ping.
> As a workaround, we are raising session timeout to 50 seconds.
> But this causes max. failover of cluster to significantly increased, thus 
> initial QoS we promised cannot be met.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1049) Session expire/close flooding renders heartbeats to delay significantly

Reply via email to