Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-60870953 @tedxia I got a chance to chat with Ted online. In summary, he is descrbing the following case ï¼worker A -> worker B): 1. B dies 2. after zk session timeout, zk knows B is dead 3. A is initiating the reconnection process to B. By default, it will retry 300 times at max.(it should be larger than 120second, based on the comments in config) â ``` # Since nimbus.task.launch.secs and supervisor.worker.start.timeout.secs are 120, other workers should also wait at least that long before giving up on connecting to the other worker.```â 4. zk is under heavy load(consider a zk tree which have 100 thoudsands nodes, and many many watchers), may take minutes to notify A that B is dead. 5. A didn't get notification from zk in time after 300 connection retries, reconnection failedï¼ it throws, which will cause the worker to exit. Basically there are two questions asked. First, whether we can assure the zookeeper is responsive(< 1minute). Second, If worker doesn't get update of B from zookeeper after 300 reconnection retries, should we exit the worker or let worker continues to work?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---