On Wed, 2024-11-06 at 09:40 +0100, Felix Moessbauer wrote: > On Wed, 2024-11-06 at 07:19 +0100, Linus Nordberg wrote: > > "MOESSBAUER, Felix" <[email protected]> wrote > > Tue, 5 Nov 2024 09:09:21 +0000: > > > > > Further, the rate-limits should be precisely documented so > > > clients > > > / > > > caching proxies can adapt to this. The limits also need to match > > > the > > > retry-after header in the 429 responses. Currently s.d.o responds > > > with > > > retry-after 5 (seconds), which is by far to short to overcome the > > > limit. > > > > Thanks for reporting this. > > > > I've been planning on documenting the rate limiting somewhere once > > it > > seems to behave reasonably well. It was added in a bit of a haste > > last > > weekend. > > > > One thing I would hesitate to put in a document somewhere for a > > human > > to > > read once and then base their implementation on is numbers that are > > likely to change. What do you think about the server setting > > X-RateLimit-Remaining in responses? > > Hi, > > that would be even better. In [1] there are more headers regarding > rate > limiting which could be considered. Maybe check these as well. > > [1] > https://www.ietf.org/archive/id/draft-polli-ratelimit-headers-02.html > > > > > The Retry-After has been fixed in [this patchset][] which I will > > try > > to > > get merged in a day or two. > > Thanks! Currently apt does not obey this, but I'll try to add > support.
Hi, I'm currently working on supporting this in ... Feel free to involve yourself. Especially the needed (random) request distribution seems to be problematic from apt perspective, but needed to support rate-limitings. https://salsa.debian.org/apt-team/apt/-/merge_requests/383 > Currently apt just retries with exponential backoff (1,2,4,...) secs > and most people just recommend a max of 3 retries. Hence, this does > not > help at all to overcome the limit. But lets fix things one-by-one. > > > > > [this patchset]: > > https://salsa.debian.org/linus/dsa-puppet/-/compare/production...linus%2Fsnapshot-ratelimit?from_project_id=2990 > > > > > > > If rate-limiting would be implemented correctly, downstream > > > caches > > > could properly cache the results and clients like apt could > > > behave > > > nicely. I further recommend to use WAY higher request limits in > > > combination with a moving average limit on the amount of > > > transferred > > > data. By that, the cheap "is my cache still valid" requests could > > > pass, > > > while the more heavy payload transfers are avoided. Also clients > > > could > > > hit s.d.o without reduced transfer rates, hence reducing the > > > amount > > > of > > > open handles on the server. > > > > What do you think would be a reasonable request limit? > > That depends on what your server can handle. It's been 5 years since > I > worked in that field, so take the following with a grain of salt: > > In general, only as few as possible requests should reach the flask > application, as it is not suited for high loads. Running it under > gunicorn or another executor instead of direct improve things a bit > as > you get multithreading. Anyways, the most important thing is to > return > correct cache headers from the flask app, so Varnish can cache them. After re-visiting this, I noticed that probably the ONLY missing piece in a performant snapshot.d.o is providing meaningful cache headers for the redirects to static files on /archive. Currently these redirects are served with a 302 and 600s max-age. By serving these with a 301 and either no or a long max-age this should be fixed. The corresponding code is here: https://salsa.debian.org/snapshot-team/snapshot/-/blob/master/web/app/snapshot/views/archive.py?ref_type=heads#L96 As the rate-limit is already only applied to cache misses (from Varnish perspective) most clients (and especially repeated builds) are not expected to run into the limits at all. > These cache hit rates should be monitored and should be close to 100% > during normal operation (except for the intentionally direct calls). > In > contrast, large files (e.g. large / all debs) should be served > directly > (bypassing the cache and ideally flask). For details, please also see > [2]. > Finally, the storage backend needs to be fast - and by that not a > fuse. I just noticed, that this is already implemented (after finding the corresponding config file...). That's great news: https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/production/modules/roles/templates/snapshot/snapshot.debian.org.conf.erb?ref_type=heads#L82 Best regards, Felix > > Once all that is implemented, likely the limits can be much higher > without risking DoS. > > [2] > https://tedboy.github.io/flask/generated/flask.send_from_directory.html > > > > > We're currently limiting number of requests per time unit since > > that's > > what killed one of the servers last week. I haven't looked into how > > to > > limit the amount of transferred data yet. > > Having this limit applied per IP is unfortunate for big companies > (like > Siemens), as all user are funneled via a couple of egress IPs. > > Best regards, > Felix Moessbauer > -- Siemens AG, Technology Linux Expert Center
