Github user clockfly commented on the pull request:
https://github.com/apache/storm/pull/268#issuecomment-60870953
@tedxia
I got a chance to chat with Ted online. In summary, he is descrbing the
following case ï¼worker A -> worker B):
1. B dies
2. after zk session timeout, zk knows B is dead
3. A is initiating the reconnection process to B. By default, it will retry
300 times at max.(it should be larger than 120second, based on the comments in
config) â ``` # Since nimbus.task.launch.secs and
supervisor.worker.start.timeout.secs are 120, other workers should also wait
at least that long before giving up on connecting to the other worker.```â
4. zk is under heavy load(consider a zk tree which have 100 thoudsands
nodes, and many many watchers), may take minutes to notify A that B is dead.
5. A didn't get notification from zk in time after 300 connection retries,
reconnection failedï¼ it throws, which will cause the worker to exit.
Basically there are two questions asked. First, whether we can assure the
zookeeper is responsive(< 1minute). Second, If worker doesn't get update of B
from zookeeper after 300 reconnection retries, should we exit the worker or let
worker continues to work?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---