[ 
https://issues.apache.org/jira/browse/STORM-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049003#comment-15049003
 ] 

ASF GitHub Bot commented on STORM-1376:
---------------------------------------

Github user redsanket commented on the pull request:

    https://github.com/apache/storm/pull/933#issuecomment-163330056
  
    @revans2 I have launched around 5 topologies on 5 supervisor nodes and 
could see lot of connections being made to zookeeper and event processing 
slowing down. But I could not observe zookeeper slowing down to a point it 
could not respond. I will launch more topologies to make sure it really slows 
down. I realized in this scenario that it is only performing reads and as per 
zookeeper implementation I presume connections are thread safe and hence one 
connection to read information would be sufficient I guess. An interesting 
point came up with synchronization here, I looked at the localizer code and I 
have seen downloads to the supervisor are performed by locking but updates are 
not therefore to support nimbus HA checkForUpdate and checkForDownload might 
have to be synchronized. Will make sure by running it with more topologies. 
Thanks for the review


> ZK Becoming deadlocked with zookeeper_state_factory
> ---------------------------------------------------
>
>                 Key: STORM-1376
>                 URL: https://issues.apache.org/jira/browse/STORM-1376
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 0.11.0
>            Reporter: Daniel Schonfeld
>            Assignee: Sanket Reddy
>            Priority: Blocker
>
> Since the introduction of blobstore and pacemaker we've noticed that when 
> using nimbus with the new zookeeper_state_factory backing cluster state 
> module, some of our ZK nodes become unresponsive and show and increasing 
> amounts of outstanding requests (STAT 4-letter command).
> Terminating storm supervisors and nimbus usually gets zookeeper to realize 
> after a few minutes those connections are dead and to become responsive 
> again.  In some extreme cases we have to kill that ZK nodes and bring it back 
> up.
> Our topologies ran across ~10 supervisor nodes with each having about 
> ~400-500 executors. 
> I mention the amount of executors cause I am not sure if someone made each 
> executor by mistake start sending heartbeats instead of each worker and that 
> might possibly be the reason for this slow down.
> Final note.  If someone can jot a few ideas of why this might be happening 
> i'd be more than happy to dig further in the storm code and submit a PR 
> myself.  But I need some hint or direction of where to go with this...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to