rob05c commented on pull request #6017: URL: https://github.com/apache/trafficcontrol/pull/6017#issuecomment-880070899
> The amount of requests going to the DB would be a factor of how many TO instances there are, and if we're running roughly 10 (which is a lot IMO), TODB should be able to handle 10 concurrent requests for the same thing without tipping over. I know t3c requests all the data at once, so let's say 10 concurrent requests per route. That still doesn't seem like something that should overload TODB. What I'm getting at is that RWR without a small cache probably does improve performance quite a bit, but it doesn't have the data consistency issues that a timed cached has. Maybe RWR alone is enough? We have around 3000 caches. `t3c` requests 19 endpoints currently. If we have 10 TO instances, that's 5700 concurrent requests to each, and 57,000 concurrent requests to the database. The numbers I'm seeing suggest, depending on the endpoint, depending on the hardware, TO has trouble with as few as 10 requests/second or even less. Our production backs this up. Our production caches frequently get failures from TO, for extended periods, enough the retry completely fails and the `t3c` run fails. This is with a 15-minute `t3c` interval. > RWR without a small cache probably does improve performance quite a bit Both the theory I've explained and the practice I've observed suggest RWR alone helps, yes, but all three are necessary. IMS, RWR, and a small cache all work together to solve distinct problems. But moreover, RWR is a cache, keeping it but getting rid of the small-cache doesn't make clients that aren't following HTTP requirements okay. In fact, with a 1s cache, a great many TO requests take longer than that, so the RWR cache is frequently longer than the small-cache, not shorter. > but it doesn't have the data consistency issues that a timed cached has RWR is a cache, for consistency purposes. Clients always need to do proper HTTP. This is part of the HTTP standard, clients should be doing it today. Any client doing a POST and immediate GET without a `no-cache` or `max-age` is just lucky it works today, they're ignoring large parts of the HTTP spec. Moreover, ATC is a CDN. Our developers and operators understand HTTP Caching. It's what we do. > there is still the problem that if I make a change, I might not be able to read that change back for however long the cache time is. A script needing to do that quickly can send a `no-cache` in its request, and further an `Age` is sent indicating if the response was cached. Again, this is part of HTTP, and clients are wrong to not be doing it today. > I think that is actually why the TO API tests are failing -- they can't read their writes back immediately, which fundamentally breaks the tests unless a 1 second sleep is inserted after all writes (which would be absurd). We definitely want this feature to be tested, so I wouldn't want us to just turn it off for the TO API tests. Nobody is suggesting we turn off the tests, or add sleeps. The tests which immediately POST and GET can send a `no-cache`, and this PR includes tests for the caching itself. > but it does have it on by default. > The default case shouldn't contain surprises > new features like this should generally be disabled by default I don't agree. Defaults should be sane. This is something that 99% of ATC operators will want enabled, and it increases scalability 10-100x, and the concerns are all solved by standard HTTP Caching mechanisms, which are well-known and well-understood, HTTP Caching has been around for decades. Having a default 1% of people might want to disable and which decreases scalability 100x is not sane. I really really don't like it, but, this is something I'd be willing to compromise on. If @rawlinp and @alficles would be willing to not block this, I can live with a default I don't consider sane. Would that be acceptable? If it defaults to disabled? And I'll fix the tests of course, I just haven't had time; and the caching is and will be tested. Then it won't be enabled for ATC users unless they intentionally enable it. This is something Comcast needs. I'm willing to compromise, but I'd really like to come to a consensus here, and find a way we can get what we need. It's difficult to overstate the operational value of being able to deploy cache config quickly, having the TP "clocks" clear in a minute instead of an hour. It's a massive operational win, for the ops cost, human cost, safety, and numerous other issues with how long it takes today and how often it fails. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
