Github user kshri23 commented on the issue:

    https://github.com/apache/trafficserver/pull/872
  
    James, 
    I believe that this patch doesn't address just a symptom, it addresses a 
fundamental flaw in the startup code. A race condition. As I mentioned in the 
bug description, this issue is not because of the time required for the initial 
message exchange, but it is because of TS-4646 where repeated and unnecessary 
messages are being sent with a frequency which is exactly the same as 
mgmt_read_timeout. Of course, by fixing TS-4646, we will not hit this, agreed. 
But this design of waiting for 10k messages before yielding is flawed. We 
cannot do that because traffic_cop expects a few things from traffic_server.
    
    The only reason I decided to address the issue this way is because of the 
precedent set by 'timeout' in the same class. Initially, timeout is set to '0'. 
And once the startup is complete, it is set back to the configured value.
    
    I am confused by your reference to TS-4646. As I mentioned there, TS-4646 
should happen all the time and I think it does. It is easy to verify that by 
enabling debug logs. Its just that TS-4646 does not always result in TS-4735 in 
all cases. It is a race condition.
    
    However, on our VM's it happened all the time. We used this patch to solve 
the issue and it seems to be working for past few months. Please let me know if 
you have any concerns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to