Daniel Schonfeld created STORM-1376:
---------------------------------------
Summary: ZK Becoming deadlocked with zookeeper_state_factory
Key: STORM-1376
URL: https://issues.apache.org/jira/browse/STORM-1376
Project: Apache Storm
Issue Type: Bug
Components: storm-core
Affects Versions: 0.11.0
Reporter: Daniel Schonfeld
Since the introduction of blobstore and pacemaker we've noticed that when using
nimbus with the new zookeeper_state_factory backing cluster state module, some
of our ZK nodes become unresponsive and show and increasing amounts of
outstanding requests (STAT 4-letter command).
Terminating storm supervisors and nimbus usually gets zookeeper to realize
after a few minutes those connections are dead and to become responsive again.
In some extreme cases we have to kill that ZK nodes and bring it back up.
Our topologies ran across ~10 supervisor nodes with each having about ~400-500
executors.
I mention the amount of executors cause I am not sure if someone made each
executor by mistake start sending heartbeats instead of each worker and that
might possibly be the reason for this slow down.
Final note. If someone can jot a few ideas of why this might be happening i'd
be more than happy to dig further in the storm code and submit a PR myself.
But I need some hint or direction of where to go with this...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)