On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
> On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgo...@gentoo.org> wrote:
> 
> > Hi,
> > 
> > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > i.e. mass fake submissions.
> > 
> > 
> > Problem
> > =======
> > Goose currently lacks proper limiting of submitted data.  The only
> > limiter currently in place is based on unique submitter id that is
> > randomly generated at setup time and in full control of the submitter.
> > This only protects against accidental duplicates but it can't protect
> > against deliberate action.
> > 
> > An attacker could easily submit thousands (millions?) of fake entries by
> > issuing a lot of requests with different ids.  Creating them is
> > as trivial as using successive numbers.  The potential damage includes:
> > 
> > - distorting the metrics to the point of it being useless (even though
> > some people consider it useless by design).
> > 
> > - submitting lots of arbitrary data to cause DoS via growing
> > the database until no disk space is left.
> > 
> > - blocking large range of valid user ids, causing collisions with
> > legitimate users more likely.
> > 
> > I don't think it worthwhile to discuss the motivation for doing so:
> > whether it would be someone wishing harm to Gentoo, disagreeing with
> > the project or merely wanting to try and see if it would work.  The case
> > of SKS keyservers teaches us a lesson that you can't leave holes like
> > this open a long time because someone eventually will abuse them.
> > 
> > 
> > Option 1: IP-based limiting
> > ===========================
> > The original idea was to set a hard limit of submissions per week based
> > on IP address of the submitter.  This has (at least as far as IPv4 is
> > concerned) the advantages that:
> > 
> > - submitted has limited control of his IP address (i.e. he can't just
> > submit stuff using arbitrary data)
> > 
> > - IP address range is naturally limited
> > 
> > - IP addresses have non-zero cost
> > 
> > This method could strongly reduce the number of fake submissions one
> > attacker could devise.  However, it has a few problems too:
> > 
> > - a low limit would harm legitimate submitters sharing IP address
> > (i.e. behind NAT)
> > 
> > - it actively favors people with access to large number of IP addresses
> > 
> > - it doesn't map cleanly to IPv6 (where some people may have just one IP
> > address, and others may have whole /64 or /48 ranges)
> > 
> > - it may cause problems for anonymizing network users (and we want to
> > encourage Tor usage for privacy)
> > 
> > All this considered, IP address limiting can't be used the primary
> > method of preventing fake submissions.  However, I suppose it could work
> > as an additional DoS prevention, limiting the number of submissions from
> > a single address over short periods of time.
> > 
> > Example: if we limit to 10 requests an hour, then a single IP can be
> > used ot manufacture at most 240 submissions a day.  This might be
> > sufficient to render them unusable but should keep the database
> > reasonably safe.
> > 
> > 
> > Option 2: proof-of-work
> > =======================
> > An alternative of using a proof-of-work algorithm was suggested to me
> > yesterday.  The idea is that every submission has to be accompanied with
> > the result of some cumbersome calculation that can't be trivially run
> > in parallel or optimized out to dedicated hardware.
> > 
> > On the plus side, it would rely more on actual physical hardware than IP
> > addresses provided by ISPs.  While it would be a waste of CPU time
> > and memory, doing it just once a week wouldn't be that much harm.
> > 
> > On the minus side, it would penalize people with weak hardware.
> > 
> > For example, 'time hashcash -m -b 28 -r test' gives:
> > 
> > - 34 s (-s estimated 38 s) on Ryzen 5 3600
> > 
> > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> > 
> > At the same time, it would still permit a lot of fake submissions.  For
> > example, randomx [1] claims to require 2G of memory in fast mode.  This
> > would still allow me to use 7 threads.  If we adjusted the algorithm to
> > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> > submissions a day.
> > 
> > So in the end, while this is interesting, it doesn't seem like
> > a workable anti-spam measure.
> > 
> > 
> > Option 3: explicit CAPTCHA
> > ==========================
> > A traditional way of dealing with spam -- require every new system
> > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > for one CAPTCHA).
> > 
> > The advantage of this method is that it requires a real human work to be
> > performed, effectively limiting the ability to submit spam.
> > The disadvantage is that it is cumbersome to users, so many of them will
> > just resign from participating.
> > 
> > 
> > Other ideas
> > ===========
> > Do you have any other ideas on how we could resolve this?
> > 
> > 
> > [1] https://github.com/tevador/RandomX
> > 
> > 
> > --
> > Best regards,
> > Michał Górny
> > 
> 
> 
> Sadly, the problem with IP addresses is (in this case), that there are
> anonymous. One can easily start an attack with thousands of IPs (all around
> the world).
> 
> One solution would be to introduce user accounts:
> - one needs to register with an email

Problem 1: you can trivially mass-create email addresses.

> - you can rate limit based on the client (not the IP)
> 
> For example I've 200 servers, I'd create one account, verify my email
> (maybe captcha too) and deploy a config with my token on all servers. Then
> I'd setup a cron job on every server to submit stats. A token can have some
> lifetime and you could create a new one when the old is about to expire.
> 
> If you discover I'm doing false reports, you'd block all my submissions. I
> can still do fake submissions, but you'd need a per-host verification to
> avoid that.
> 

Problem 2: we can't really discover this because the goal is to protect
users' privacy.  The best we can do is to discover that someone is
submitting a lot from a single account (but are them legitimate?).
But then, we can just block them.

But in the end, this has the same problem as CAPTCHA -- or maybe it's
even worse.  It requires additional effort from the users, effectively
making it less likely for them to participate.  Furthermore, it requires
them to submit e-mail addresses which they may consider PII.  Even if we
don't store them permanently but just use for initial verification, they
still could choose not to participate.

-- 
Best regards,
Michał Górny

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to