rob05c commented on pull request #6017: URL: https://github.com/apache/trafficcontrol/pull/6017#issuecomment-879532832
> > In practice, with a 1s cache, it's extremely unlikely > I disagree. I think the race condition is extremely likely to occur > > We could handle that race by adding the cache time to the Update Status endpoint > Different TOs could all have different cache_ms settings, so this doesn't really solve the race. When we change /update_status to be a timestamp (which is on the short roadmap to solve a much bigger race that occurs frequently today), t3c can request the data it needs with a `Cache-Control: max-age=` of the queue timestamp. By sending max-age, the server's cache_ms is irrelevant, no TO will serve its cache if it's older than the queue time. Does that address your concern? >I appreciate the intent of making TO more scalable by adding things like timed caches and RWR, but I'm not really sure it's worth the risk of sacrificing our data consistency. We're talking about 100x more requests/second. It's the difference in being able to have 2000 caches get their config every 15 minutes -- which frequently fails so config deployment actually takes an hour or more -- and being able to get config in under 1 minute. That's the goal here: making TO able to handle fast cache config. I am sure it's worth being able to deploy cache config in under a minute instead of an hour. But with the above, it's not sacrificing any consistency. >It seems far more safe and scalable to implement something like Cache Config Snapshots It's been an ATC goal for years to remove logic from the Traffic Ops monolith, in order to make ATC more scalable, more horizontal, and safer and possible to canary-deploy changes. Adding more logic to Traffic Ops instead of removing it is fundamentally opposed to that goal. Not to mention the data consistency problems with snapshot blobs. There are far more consistency problems, problems which are often unfixable, with putting denormalized blobs in the database. Problems such as client and server compatibility. Problems which are either unfixable; or very slow, dangerous, and bug-prone to "fix" by modifying the blob before serving it. > Rather than complicate the entire API by adding a new layer that every endpoint has to go through Middleware is a fundamental part of most services, things like gzip, auth, adding headers. Traffic Ops has a pretty solid system to easily do these things, which this PR uses. That's pretty standard, and doesn't really complicate things. > We also have the IMS changes you recently added to t3c, and I'm sure that will help improve performance a bit. IMS lets us avoid unnecessarily transferring data when changes haven't occurred. But it actually amplifies the scalability problem when changes have occurred. Because when a change occurs, thousands of caches will all be requesting at once. Likewise, only having Read-While-Writer without a small cache would put constant load on the DB, every time a request finishes another will start, constantly saturating the DB. And only having a small cache without RWR would result in thousands of caches requesting at once, especially for a large slow endpoint, creating thousands of backend DB requests until the endpoint can be served and the cache established. In order to significantly reduce cache config time, we really need all three: IMS, RWR, and a very small cache. > something like Cache Config Snapshots I'd also note, all of these concerns - theoretical races, consistency issues, etc - are all symptoms of the Snapshot-Queue Automation problem (a.k.a. "Kill the Chicken"). Adding more snapshot blobs exacerbates the problem, rather than fixing it. Once we have Server Snapshots which are timestamps, and can request the real non-blob data up to that timestamp, these problems cease to exist. Caches will check the server snapshot time, see that it's changed, and request the real data up to that new time. We need to fix the Snapshot-Queue Automation problem. It's the source of this and countless other operational problems. At this point, we've spent more work adding workarounds to the problem - Delivery Service Requests, Chicken Locks, more denormalized blobs - than it would've taken to just fix the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
