Daniel Schonfeld created STORM-1376:
---------------------------------------

             Summary: ZK Becoming deadlocked with zookeeper_state_factory
                 Key: STORM-1376
                 URL: https://issues.apache.org/jira/browse/STORM-1376
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-core
    Affects Versions: 0.11.0
            Reporter: Daniel Schonfeld


Since the introduction of blobstore and pacemaker we've noticed that when using 
nimbus with the new zookeeper_state_factory backing cluster state module, some 
of our ZK nodes become unresponsive and show and increasing amounts of 
outstanding requests (STAT 4-letter command).

Terminating storm supervisors and nimbus usually gets zookeeper to realize 
after a few minutes those connections are dead and to become responsive again.  
In some extreme cases we have to kill that ZK nodes and bring it back up.

Our topologies ran across ~10 supervisor nodes with each having about ~400-500 
executors. 

I mention the amount of executors cause I am not sure if someone made each 
executor by mistake start sending heartbeats instead of each worker and that 
might possibly be the reason for this slow down.

Final note.  If someone can jot a few ideas of why this might be happening i'd 
be more than happy to dig further in the storm code and submit a PR myself.  
But I need some hint or direction of where to go with this...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to