rob05c commented on pull request #6017:
URL: https://github.com/apache/trafficcontrol/pull/6017#issuecomment-879532832


   > > In practice, with a 1s cache, it's extremely unlikely
   > I disagree. I think the race condition is extremely likely to occur
   
   > > We could handle that race by adding the cache time to the Update Status 
endpoint
   > Different TOs could all have different cache_ms settings, so this doesn't 
really solve the race.
   
   When we change /update_status to be a timestamp (which is on the short 
roadmap to solve a much bigger race that occurs frequently today), t3c can 
request the data it needs with a `Cache-Control: max-age=` of the queue 
timestamp.
   
   By sending max-age, the server's cache_ms is irrelevant, no TO will serve 
its cache if it's older than the queue time.
   
   Does that address your concern?
   
   >I appreciate the intent of making TO more scalable by adding things like 
timed caches and RWR, but I'm not really sure it's worth the risk of 
sacrificing our data consistency.
   
   We're talking about 100x more requests/second. It's the difference in being 
able to have 2000 caches get their config every 15 minutes -- which frequently 
fails so config deployment actually takes an hour or more -- and being able to 
get config in under 1 minute. That's the goal here: making TO able to handle 
fast cache config.
   
   I am sure it's worth being able to deploy cache config in under a minute 
instead of an hour. But with the above, it's not sacrificing any consistency.
   
   >It seems far more safe and scalable to implement something like Cache 
Config Snapshots
   
   It's been an ATC goal for years to remove logic from the Traffic Ops 
monolith, in order to make ATC more scalable, more horizontal, and safer and 
possible to canary-deploy changes. Adding more logic to Traffic Ops instead of 
removing it is fundamentally opposed to that goal.
   
   Not to mention the data consistency problems with snapshot blobs. There are 
far more consistency problems, problems which are often unfixable, with putting 
denormalized blobs in the database. Problems such as client and server 
compatibility. Problems which are either unfixable; or very slow, dangerous, 
and bug-prone to "fix" by modifying the blob before serving it. 
   
   > Rather than complicate the entire API by adding a new layer that every 
endpoint has to go through
   
   Middleware is a fundamental part of most services, things like gzip, auth, 
adding headers. Traffic Ops has a pretty solid system to easily do these 
things, which this PR uses. That's pretty standard, and doesn't really 
complicate things.
   
   > We also have the IMS changes you recently added to t3c, and I'm sure that 
will help improve performance a bit.
   
   IMS lets us avoid unnecessarily transferring data when changes haven't 
occurred. But it actually amplifies the scalability problem when changes have 
occurred. Because when a change occurs, thousands of caches will all be 
requesting at once.
   
   Likewise, only having Read-While-Writer without a small cache would put 
constant load on the DB, every time a request finishes another will start, 
constantly saturating the DB. And only having a small cache without RWR would 
result in thousands of caches requesting at once, especially for a large slow 
endpoint, creating thousands of backend DB requests until the endpoint can be 
served and the cache established.
   
   In order to significantly reduce cache config time, we really need all 
three: IMS, RWR, and a very small cache.
   
   > something like Cache Config Snapshots
   
   I'd also note, all of these concerns - theoretical races, consistency 
issues, etc - are all symptoms of the Snapshot-Queue Automation problem (a.k.a. 
"Kill the Chicken"). Adding more snapshot blobs exacerbates the problem, rather 
than fixing it.
   
   Once we have Server Snapshots which are timestamps, and can request the real 
non-blob data up to that timestamp, these problems cease to exist. Caches will 
check the server snapshot time, see that it's changed, and request the real 
data up to that new time.
   
   We need to fix the Snapshot-Queue Automation problem. It's the source of 
this and countless other operational problems.
   
   At this point, we've spent more work adding workarounds to the problem - 
Delivery Service Requests, Chicken Locks, more denormalized blobs - than it 
would've taken to just fix the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to