On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote: > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgo...@gentoo.org> wrote: > > > Hi, > > > > TL;DR: I'm looking for opinions on how to protect goose from spam, > > i.e. mass fake submissions. > > > > > > Problem > > ======= > > Goose currently lacks proper limiting of submitted data. The only > > limiter currently in place is based on unique submitter id that is > > randomly generated at setup time and in full control of the submitter. > > This only protects against accidental duplicates but it can't protect > > against deliberate action. > > > > An attacker could easily submit thousands (millions?) of fake entries by > > issuing a lot of requests with different ids. Creating them is > > as trivial as using successive numbers. The potential damage includes: > > > > - distorting the metrics to the point of it being useless (even though > > some people consider it useless by design). > > > > - submitting lots of arbitrary data to cause DoS via growing > > the database until no disk space is left. > > > > - blocking large range of valid user ids, causing collisions with > > legitimate users more likely. > > > > I don't think it worthwhile to discuss the motivation for doing so: > > whether it would be someone wishing harm to Gentoo, disagreeing with > > the project or merely wanting to try and see if it would work. The case > > of SKS keyservers teaches us a lesson that you can't leave holes like > > this open a long time because someone eventually will abuse them. > > > > > > Option 1: IP-based limiting > > =========================== > > The original idea was to set a hard limit of submissions per week based > > on IP address of the submitter. This has (at least as far as IPv4 is > > concerned) the advantages that: > > > > - submitted has limited control of his IP address (i.e. he can't just > > submit stuff using arbitrary data) > > > > - IP address range is naturally limited > > > > - IP addresses have non-zero cost > > > > This method could strongly reduce the number of fake submissions one > > attacker could devise. However, it has a few problems too: > > > > - a low limit would harm legitimate submitters sharing IP address > > (i.e. behind NAT) > > > > - it actively favors people with access to large number of IP addresses > > > > - it doesn't map cleanly to IPv6 (where some people may have just one IP > > address, and others may have whole /64 or /48 ranges) > > > > - it may cause problems for anonymizing network users (and we want to > > encourage Tor usage for privacy) > > > > All this considered, IP address limiting can't be used the primary > > method of preventing fake submissions. However, I suppose it could work > > as an additional DoS prevention, limiting the number of submissions from > > a single address over short periods of time. > > > > Example: if we limit to 10 requests an hour, then a single IP can be > > used ot manufacture at most 240 submissions a day. This might be > > sufficient to render them unusable but should keep the database > > reasonably safe. > > > > > > Option 2: proof-of-work > > ======================= > > An alternative of using a proof-of-work algorithm was suggested to me > > yesterday. The idea is that every submission has to be accompanied with > > the result of some cumbersome calculation that can't be trivially run > > in parallel or optimized out to dedicated hardware. > > > > On the plus side, it would rely more on actual physical hardware than IP > > addresses provided by ISPs. While it would be a waste of CPU time > > and memory, doing it just once a week wouldn't be that much harm. > > > > On the minus side, it would penalize people with weak hardware. > > > > For example, 'time hashcash -m -b 28 -r test' gives: > > > > - 34 s (-s estimated 38 s) on Ryzen 5 3600 > > > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M > > > > At the same time, it would still permit a lot of fake submissions. For > > example, randomx [1] claims to require 2G of memory in fast mode. This > > would still allow me to use 7 threads. If we adjusted the algorithm to > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k > > submissions a day. > > > > So in the end, while this is interesting, it doesn't seem like > > a workable anti-spam measure. > > > > > > Option 3: explicit CAPTCHA > > ========================== > > A traditional way of dealing with spam -- require every new system > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers > > for one CAPTCHA). > > > > The advantage of this method is that it requires a real human work to be > > performed, effectively limiting the ability to submit spam. > > The disadvantage is that it is cumbersome to users, so many of them will > > just resign from participating. > > > > > > Other ideas > > =========== > > Do you have any other ideas on how we could resolve this? > > > > > > [1] https://github.com/tevador/RandomX > > > > > > -- > > Best regards, > > Michał Górny > > > > > Sadly, the problem with IP addresses is (in this case), that there are > anonymous. One can easily start an attack with thousands of IPs (all around > the world). > > One solution would be to introduce user accounts: > - one needs to register with an email
Problem 1: you can trivially mass-create email addresses. > - you can rate limit based on the client (not the IP) > > For example I've 200 servers, I'd create one account, verify my email > (maybe captcha too) and deploy a config with my token on all servers. Then > I'd setup a cron job on every server to submit stats. A token can have some > lifetime and you could create a new one when the old is about to expire. > > If you discover I'm doing false reports, you'd block all my submissions. I > can still do fake submissions, but you'd need a per-host verification to > avoid that. > Problem 2: we can't really discover this because the goal is to protect users' privacy. The best we can do is to discover that someone is submitting a lot from a single account (but are them legitimate?). But then, we can just block them. But in the end, this has the same problem as CAPTCHA -- or maybe it's even worse. It requires additional effort from the users, effectively making it less likely for them to participate. Furthermore, it requires them to submit e-mail addresses which they may consider PII. Even if we don't store them permanently but just use for initial verification, they still could choose not to participate. -- Best regards, Michał Górny
signature.asc
Description: This is a digitally signed message part