Github user clockfly commented on the pull request:

    https://github.com/apache/storm/pull/268#issuecomment-60870953
  
    @tedxia 
    
    I got a chance to chat with Ted online. In summary, he is descrbing the 
following case (worker A -> worker B):
    1. B dies
    2. after zk session timeout, zk knows B is dead
    3. A is initiating the reconnection process to B. By default, it will retry 
300 times at max.(it should be larger than 120second, based on the comments in 
config) “ ``` # Since nimbus.task.launch.secs and 
supervisor.worker.start.timeout.secs are 120, other workers  should also wait 
at least that long before giving up on connecting to the other worker.```”
    4. zk is under heavy load(consider a zk tree which have 100 thoudsands 
nodes, and many many watchers), may take minutes to notify A that B is dead.
    5. A didn't get notification from zk in time after 300 connection retries, 
reconnection failed, it throws, which will cause the worker to exit.
    
    Basically there are two questions asked. First, whether we can assure the 
zookeeper is responsive(< 1minute). Second, If worker doesn't get update of B 
from zookeeper after 300 reconnection retries, should we exit the worker or let 
worker continues to work?
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to