Re: [gentoo-dev] [RFC] Anti-spam for goose

Jaco Kroon Thu, 21 May 2020 03:46:20 -0700

Hi,

On 2020/05/21 11:48, Tomas Mozes wrote:
>
>
> On Thu, May 21, 2020 at 10:47 AM Michał Górny <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi,
>
>     TL;DR: I'm looking for opinions on how to protect goose from spam,
>     i.e. mass fake submissions.
>     Option 1: IP-based limiting
>     ===========================
>     The original idea was to set a hard limit of submissions per week
>     based
>     on IP address of the submitter.  This has (at least as far as IPv4 is
>     concerned) the advantages that:
>
>     - submitted has limited control of his IP address (i.e. he can't just
>     submit stuff using arbitrary data)
>
>     - IP address range is naturally limited
>
>     - IP addresses have non-zero cost
>
>     This method could strongly reduce the number of fake submissions one
>     attacker could devise.  However, it has a few problems too:
>
>     - a low limit would harm legitimate submitters sharing IP address
>     (i.e. behind NAT)
>
>     - it actively favors people with access to large number of IP
>     addresses
>
>     - it doesn't map cleanly to IPv6 (where some people may have just
>     one IP
>     address, and others may have whole /64 or /48 ranges)
>
So this gets tricky.  A single host could as you say either have a /128
or possibly a whole /64.  ISPs are "encouraged" to use a single /64 per
connecting user on the access layer (can be link-local technically, but
it seems to be frowned upon).  Generally then you're encourages to
delegate a /56 to the router, but at the very least a /60.  Some
recommendations even state to delegate a /48 at this point.  That's
outright crazy seeing that a /48 essentially boils down to 65536
individual LANs behind the router, /56 is 256 LANs which frankly I
reckon is adequate.  The only advantage of /48 is cleaner boudary
mapping onto : separators.  This is OPINION.  I also use "encouraged"
since these are


Short version:  If you're willing to rate limit on larger blocks it
could work.  /64s are probably OK, but most hosts will typically have a
/128, so you'll be limiting LANs, and switching IPs is trivial as you'd
have access to at least a /64 (or ~18.45 * 10^18).

You could have multiple layers ... ie:

each /128 gets 1 or 2 submissions per day
each /64 gets 200/day
each /56 gets 400/day
each /48 gets 600/day

But now you need to keep bucket loads of data ... so DOS on the rate
limiting mechanism itself becomes possible unless you're happy to limit
the size of the tables and discard "low risk of exceeding entries" somehow.

Even for v4, as an attacker ... well, as I'm sitting here right now I've
got direct access to almost a /20 (4096 addresses).  I know a number of
people with larger scopes than that.  Use bot-nets and the scope goes up
even more.

>
>
>     Option 2: proof-of-work
>     =======================
>     An alternative of using a proof-of-work algorithm was suggested to me
>     yesterday.  The idea is that every submission has to be
>     accompanied with
>     the result of some cumbersome calculation that can't be trivially run
>     in parallel or optimized out to dedicated hardware.
>
>     On the plus side, it would rely more on actual physical hardware
>     than IP
>     addresses provided by ISPs.  While it would be a waste of CPU time
>     and memory, doing it just once a week wouldn't be that much harm.
>
>     On the minus side, it would penalize people with weak hardware.
>
>     For example, 'time hashcash -m -b 28 -r test' gives:
>
>     - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
>     - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
>     At the same time, it would still permit a lot of fake
>     submissions.  For
>     example, randomx [1] claims to require 2G of memory in fast mode. 
>     This
>     would still allow me to use 7 threads.  If we adjusted the
>     algorithm to
>     take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
>     submissions a day.
>
>     So in the end, while this is interesting, it doesn't seem like
>     a workable anti-spam measure.
>
Indeed.  This was considered for email SPAM protection as well about two
decades back.  Amongst other proposals.

Perhaps some crazy proof-of-work for registration of a token, but given
how cheap it is to lease CPU cycles you'd need to balance the effects. 
And given bot nets ... using other people's hardware for proof-of-work
doesn't seem inconceivable (bitcoin miners embedded on web pages being
an example of the stuff that people pull).

>
>
>     Option 3: explicit CAPTCHA
>     ==========================
>     A traditional way of dealing with spam -- require every new system
>     identifier to be confirmed by solving a CAPTCHA (or a few identifiers
>     for one CAPTCHA).
>
>     The advantage of this method is that it requires a real human work
>     to be
>     performed, effectively limiting the ability to submit spam.
>
Yea.  One would think.  CAPTCHAs are massively intrusive and in my
opinion more effort than they're worth.

This may be beneficial to *generate* a token.  In other words - when
generating a token, that token needs to be registered by way of capthca.


>     The disadvantage is that it is cumbersome to users, so many of
>     them will
>     just resign from participating.
>
Agreed.

>
>
>     Other ideas
>     ===========
>     Do you have any other ideas on how we could resolve this?
>
Generated token + hardware based hash.  Rate limit the combination to 1/day.

Don't use included results until it's been kept up to date for a minimum
period.  Say updated at least 20 times 30 days.

I note currently you can submit once in 7, I'd change this approach to
something like:

* Update the results as often as you wish but at most every 23 hours
(basically aim at submitting daily).
* Expire all results that haven't been updated in X number of days (I'd
use a 7 here out of hand).
* Expire the token after 30 days of not being kept up to date and
require going through the initial working again.

The downside here is that many machines are not powered up at least once
a day to be able to perform that initial submission sequence.  So
perhaps it's a bit stringent.

So single token can submit for multiple hosts (cloned machines).

Both the token and hardware hash can of course be tainted and is under
"attacker control".

>
> Sadly, the problem with IP addresses is (in this case), that there are
> anonymous. One can easily start an attack with thousands of IPs (all
> around the world).
>
> One solution would be to introduce user accounts:
> - one needs to register with an email
> - you can rate limit based on the client (not the IP)
>
> For example I've 200 servers, I'd create one account, verify my email
> (maybe captcha too) and deploy a config with my token on all servers.
> Then I'd setup a cron job on every server to submit stats. A token can
> have some lifetime and you could create a new one when the old is
> about to expire.

I, sadly so, agree with this.  I'm quite happy to register an account,
and on each machine during gander --init enter a username + password to
link my email-based token to the host.

If machine gets cloned, the hardware hash takes care of conflicts.

Else, during gander --init some proof-of-work may be OK to generate an
anonymous token.

Rate limit *token generation* against IP address for anonymous tokens.

For anonymous tokens, only a single hardware hash is allowed, if the
hardware hash changes, re-require proof-of-work, discard data.

Final summary:

# gander --init --account [email protected] [--mayclone]
Password: ?????
... generate HW hash if not --mayclone
... submit credentials (via https please).
... get token.
#

Or:

# gander --init --anonymous
... contact server, get work
... do work
... generate HW hash
... submit HW hash + proof of work
... get token

Now each (token+hash) can submit at most once in 23 hours, discard data
after 7 days if not kept up to date.

Anonymous tokens are linked to a HW hash.  User accounts gets to issue
tokens as needed, and each one has a flag that allows for setting
whether or not the HW hash may change.  This is more for user benefit
for those of us that does make use of clones.  And the explicit
requirement is to prevent accidental error.

Kind Regards,
Jaco

Re: [gentoo-dev] [RFC] Anti-spam for goose

Reply via email to