The topic of rebooting gentoostats comes here from time to time.  Unless
I'm mistaken, all the efforts so far were superficial, lacking a clear
plan and unwilling to research the problems.  I'd like to start
a serious discussion focused on the issues we need to solve, and propose
some ideas how we could solve them.

I can't promise I'll find time to implement it.  However, I'd like to
get a clear plan on how it should be done if someone actually does it.

The big questions
The way I see it, the primary goal of the project would be to gather
statistics on popularity of packages, in order to help us prioritize our
attention and make decisions on what to keep and what to remove.  Unlike
Debian's popcon, I don't think we really want to try to investigate
which files are actually used but focus on what's installed.

There are a few important questions that need to be answered first:

1. Which data do we need to collect?

   a. list of installed packages?
   b. versions (or slots?) of installed packages?
   c. USE flags on installed packages?
   d. world and world_sets files
   e. system profile?
   f. enabled repositories? (possibly filtered to official list)
   g. distribution?

I think d. is most important as it gives us information on what users
really want.  a. alone is kinda redundant is we have d.  c. might have
some value when deciding whether to mask a particular flag (and implies

e. would be valuable if we wanted to determine the future of particular
profiles, as well as e.g. estimate the transition to new versions.

f. would be valuable to determine which repositories are used but we
need to filter private repos from the output for privacy reasons.

g. could be valuable in correlation with other data but not sure if
there's much direct value alone.

2. How to handle Gentoo derivatives?  Some of them could provide
meaningful data but some could provide false data (e.g. when derivatives
override Gentoo packages).  One possible option would be to filter a.-e. 
to stuff coming from ::gentoo.

3. How to keep the data up-to-date?  After all, if we just stack a lot
of old data, we will soon stop getting meaningful results.  I suppose
we'll need to timestamp all data and remove old entries.

4. How to avoid duplication?  If some users submit their results more
often than others, they would bias the results.  3. might be related.

5. How to handle clusters?  Things are simple if we can assume that
people will submit data for a few distinct systems.  But what about
companies that run 50 Gentoo machines with the same or similar setup? 
What about clusters of 1000 almost identical containers?  Big entities
could easily bias the results but we should also make it possible for
them to participate somehow.

6. Security.  We don't want to expose information that could be
correlated to specific systems, as it could disclose their

7. Privacy.  Besides the above, our sysadmins would appreciate if
the data they submitted couldn't be easily correlated to them.  If we
don't respect privacy of our users, we won't get them to submit data.

8. Spam protection.  Finally, the service needs to be resilient to being
spammed with fake data.  Both to users who want to make their packages
look more important, and to script kiddies that want to prove a point.

My (partial) implementation idea
I think our approach should be oriented on privacy/security first,
and attempt to make the best of the data we can get while respecting
this principle.  This means no correlation and no tracking.

Once the tool is installed, the user needs to opt-in to using it.  This
involves accepting a privacy policy and setting up a cronjob.  The tool
would suggest a (random?) time for submission to take place periodically
(say, every week).

The submission would contain only raw data, without any identification
information.  It would be encrypted using our public key.  Once
uploaded, it would be put into our input queue as-is.

Periodically the input queue would be processed in bulk.  The individual
statistics would be updated and the input would be discarded.  This
should prevent people trying to correlate changes in statistics with
individual uploads.

Each counted item would have a timestamp associated, and we'd discard
old items per resubmission period.  This should ensure that we keep
fresh data and people can update their earlier submissions without
storing identification data.

For example, N users submit their data containing a list of packages
every week.  This data is used in bulk to update counts of individual
packages (technically, to append timestamps to list corresponding to
these packages).  Data older than one week is discarded, so we have
rough counts of package use during the last week.

I think this addresses problems 3./6./7.

The other major problem is spam protection.  The best semi-anonymous way
I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
We could set a limit of, say, 10 submissions per IPv4 address per week. 
If some address would exceed that limit, we could require CAPTCHA

I think this would make spamming a bit harder while keeping submissions
easy for the most, and a little harder but possible for those of us
behind ISP NATs.

This should address problems 4./8. and maybe 5. to some degree.

A proper solution to cluster problem would probably involve some way to
internally collect and combine data data before submission.  If you have
large clusters of similar systems, I think you'd want to have all
packages used on different systems reported as one entry.

I think we should collect data from users running all Gentoo
derivatives, as long as they are using Gentoo packages.  The simplest
solution I can think of would be to filter the results on packages (or
profiles) installed from ::gentoo.  This will work only for distros that
expose ::gentoo explicitly (vs copying our ebuilds to their
repositories) though.

What do you think?  Do you foresee other problems?  Do you have other
needs?  Can you think of better solutions?

Best regards,
Michał Górny

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to