rob05c commented on pull request #6017:
URL: https://github.com/apache/trafficcontrol/pull/6017#issuecomment-880070899


   > The amount of requests going to the DB would be a factor of how many TO 
instances there are, and if we're running roughly 10 (which is a lot IMO), TODB 
should be able to handle 10 concurrent requests for the same thing without 
tipping over. I know t3c requests all the data at once, so let's say 10 
concurrent requests per route. That still doesn't seem like something that 
should overload TODB. What I'm getting at is that RWR without a small cache 
probably does improve performance quite a bit, but it doesn't have the data 
consistency issues that a timed cached has. Maybe RWR alone is enough?
   
   We have around 3000 caches. `t3c` requests 19 endpoints currently. If we 
have 10 TO instances, that's 5700 concurrent requests to each, and 57,000 
concurrent requests to the database.
   
   The numbers I'm seeing suggest, depending on the endpoint, depending on the 
hardware, TO has trouble with as few as 10 requests/second or even less. 
   
   Our production backs this up. Our production caches frequently get failures 
from TO, for extended periods, enough the retry completely fails and the `t3c` 
run fails. This is with a 15-minute `t3c` interval.
   
   > RWR without a small cache probably does improve performance quite a bit
   
   Both the theory I've explained and the practice I've observed suggest RWR 
alone helps, yes, but all three are necessary. IMS, RWR, and a small cache all 
work together to solve distinct problems. But moreover, RWR is a cache, keeping 
it but getting rid of the small-cache doesn't make clients that aren't 
following HTTP requirements okay. In fact, with a 1s cache, a great many TO 
requests take longer than that, so the RWR cache is frequently longer than the 
small-cache, not shorter.
   
   > but it doesn't have the data consistency issues that a timed cached has
   
   RWR is a cache, for consistency purposes. Clients always need to do proper 
HTTP. This is part of the HTTP standard, clients should be doing it today. Any 
client doing a POST and immediate GET without a `no-cache` or `max-age` is just 
lucky it works today, they're ignoring large parts of the HTTP spec.
   
   Moreover, ATC is a CDN. Our developers and operators understand HTTP 
Caching. It's what we do.
   
   > there is still the problem that if I make a change, I might not be able to 
read that change back for however long the cache time is. 
   
   A script needing to do that quickly can send a `no-cache` in its request, 
and further an `Age` is sent indicating if the response was cached. Again, this 
is part of HTTP, and clients are wrong to not be doing it today.
   
   >  I think that is actually why the TO API tests are failing -- they can't 
read their writes back immediately, which fundamentally breaks the tests unless 
a 1 second sleep is inserted after all writes (which would be absurd). We 
definitely want this feature to be tested, so I wouldn't want us to just turn 
it off for the TO API tests.
   
   Nobody is suggesting we turn off the tests, or add sleeps. The tests which 
immediately POST and GET can send a `no-cache`, and this PR includes tests for 
the caching itself.
   
   > but it does have it on by default.
   
   > The default case shouldn't contain surprises
   
   > new features like this should generally be disabled by default
   
   I don't agree. Defaults should be sane. This is something that 99% of ATC 
operators will want enabled, and it increases scalability 10-100x, and the 
concerns are all solved by standard HTTP Caching mechanisms, which are 
well-known and well-understood, HTTP Caching has been around for decades. 
Having a default 1% of people might want to disable and which decreases 
scalability 100x is not sane. 
   
   I really really don't like it, but, this is something I'd be willing to 
compromise on. If @rawlinp and @alficles would be willing to not block this, I 
can live with a default I don't consider sane.
   
   Would that be acceptable? If it defaults to disabled? And I'll fix the tests 
of course, I just haven't had time; and the caching is and will be tested. Then 
it won't be enabled for ATC users unless they intentionally enable it.
   
   This is something Comcast needs. I'm willing to compromise, but I'd really 
like to come to a consensus here, and find a way we can get what we need. It's 
difficult to overstate the operational value of being able to deploy cache 
config quickly, having the TP "clocks" clear in a minute instead of an hour. 
It's a massive operational win, for the ops cost, human cost, safety, and 
numerous other issues with how long it takes today and how often it fails.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to