Hi all, I believe that we are all have forgotten about Donald Knuth: Premature optimisation is the root of all evill.
We don't have "spam" yet, but we are already trying to protect. There might be cases when some systems will be posting stats more often than we want, but probably that will not harm us. Or this will be done by our main users who runs 1kk of gentoo installations and this "spam" will be actually valuable. Moreover, nobody forces us to treat info from 'goose' as first priority, so we are still able to select on which packages to work. In short: this topic is not so important yet, I think. Viktar On Thu, May 21, 2020, 16:28 Jaco Kroon <j...@uls.co.za> wrote: > Hi Michał, > > On 2020/05/21 13:02, Michał Górny wrote: > > On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote: > >> Even for v4, as an attacker ... well, as I'm sitting here right now I've > >> got direct access to almost a /20 (4096 addresses). I know a number of > >> people with larger scopes than that. Use bot-nets and the scope goes up > >> even more. > > See how unfair the world is! You are filling your bathtub with IP > > addresses, and my ISP has taken mine only recently. > I must admit, I work for an ISP :$ > >>> Option 3: explicit CAPTCHA > >>> ========================== > >>> A traditional way of dealing with spam -- require every new system > >>> identifier to be confirmed by solving a CAPTCHA (or a few > identifiers > >>> for one CAPTCHA). > >>> > >>> The advantage of this method is that it requires a real human work > >>> to be > >>> performed, effectively limiting the ability to submit spam. > >>> > >> Yea. One would think. CAPTCHAs are massively intrusive and in my > >> opinion more effort than they're worth. > >> > >> This may be beneficial to *generate* a token. In other words - when > >> generating a token, that token needs to be registered by way of capthca. > >> > >>> Other ideas > >>> =========== > >>> Do you have any other ideas on how we could resolve this? > >>> > >> Generated token + hardware based hash. > > How are you going to verify that the hardware-based hash is real, > > and not just a random value created to circumvent the protection? > > So the generation of the hash is more to validate that it's still on the > same installation (ie, not a cloned token). Sorry if that wasn't clear, > so trying to solve two possible problems in one go. > > > > >> Rate limit the combination to 1/day. > >> > >> Don't use included results until it's been kept up to date for a minimum > >> period. Say updated at least 20 times 30 days. > > For privacy reasons, we don't correlate the results. So this is > > impossible to implement. > > Ok, but a token cannot (unless we issue it based on an email based > account) be linked back to a specific user, so does it matter if we > associate uploads with a token? > > >> The downside here is that many machines are not powered up at least once > >> a day to be able to perform that initial submission sequence. So > >> perhaps it's a bit stringent. > > Exactly. Even once a week is a bit risky but once a day is too narrow > > a period. > > > > To some degree, we could decide we don't care about exact numbers > > as much as some degree of weighed proportions. This would mean that, > > say, people who submit daily get the count of 7, at the loss of people > > who don't run their machines that much. It would effectively put more > > emphasis on more active users. It's debatable whether this is desirable > > or not. > Decaying averages. Simple to implement, don't need all historic data. > > > > Both the token and hardware hash can of course be tainted and is under > >> "attacker control". > > Exactly. So it really looks like exercise for the sake of exercise. > > Unless tokens are *issued* as per the rest of my email you snipped > away. Wherein I proposed an issuing of both anonymous and non-anonymous > tokens. > > Kind Regards, > Jaco > > >