mikeV02 opened a new issue #6377:
URL: https://github.com/apache/trafficcontrol/issues/6377


   <!--
   ************ STOP!! ************
   If this issue identifies a security vulnerability, DO NOT submit it! 
Instead, contact
   the Apache Traffic Control Security Team at 
[email protected] and follow the
   guidelines at https://apache.org/security regarding vulnerability disclosure.
   
   - For *SUPPORT QUESTIONS*, use the #traffic-control channel on the ASF slack 
(https://s.apache.org/tc-slack-request)
   or the Traffic Control Users mailing list (send an email to 
[email protected] to subscribe).
   - Before submitting, please **SEARCH GITHUB** for a similar issue or PR
       * https://github.com/apache/trafficcontrol/issues
       * https://github.com/apache/trafficcontrol/pulls
   -->
   
   <!-- Do not submit security vulnerabilities or support requests here - see 
above -->
   ## This Bug Report affects these Traffic Control components:
   <!-- delete all those that don't apply -->
   - Traffic Monitor
   
   ## Current behavior:
   <!-- Describe how the bug happens -->
   In an optimistic quorum formed by three TMs, when a single TM detects an ATS 
server as down, its report for `/publish/CrStates` flaps between available and 
unavailable, which results in HTTP 503 on TrafficRouter when it checks that TM 
in the instant it reports unavailable. The MM in question, seems to be 
disregarding its peers report of available.
   
   Looking deeper, I noticed the the flapping of `/publish/CrStates` is just a 
consequence of another failure when TM checks for its peers. When checking 
`/publish/PeerStates`, there is also a flapping between available and 
unavailable for both of its peers. I took some packet captures for the calls to 
`/publish/CrStates?raw` on its peers and I see they actually return an 
"available" state for the cache, but somewhere in the TM that detects the ATS 
as down, it is changing the local copy of the peers states to unavailable.
   
   Following through the code it seems the bug is somewhere in 
`traffic_monitor/peer/peer.go` or `traffic_monitor/manager/manager.go`. I could 
not pin point the exact function where it fails as variables are a bit cryptic 
and I don't have that much experience reading Go.
   
   ## Expected behavior:
   <!-- Describe what the behavior would be without the bug -->
   When in an optimistic quorum, a TM that detects an ATS as down, it should 
always takes the optimistic value reported by its peers. If the other two TMs 
report the ATS as available, the TM in question should also report as available.
   
   ## Steps to reproduce:
   <!-- If the current behavior is a bug, please provide the *STEPS TO 
REPRODUCE* and
   include the applicable TC version.
   -->
   1. Deploy an optimistic quorum of minimum 3 TMs
   2. Simulate a connection drop between a single TM and an ATS server (i.e. 
firewall)
   3. Look at `/publish/CrStates` for this TM and see the stats flap between 
available and unavailable
   4. Look at `/publish/PeerStates` and see state flap between available and 
unavailable
   5. Make several streaming requests against TR (curl or browser stream)
   6. See TR also flap between successful requests and HTTP 503 errors. (this 
propagates from the flaps in the affected TM)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to