Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

Samuel Bernardo Sun, 26 Apr 2020 10:24:39 -0700

Hi everyone,

gentoostats is a novelty for me and I'm not aware of previous
discussions or implementations. But for what I could understand from the
comments and Michał Górny explanation, I would start to ask your
attention to octoverse[1] initiative.


Maybe collected statistics could be a possible from a platform to get
the additional metadata for the stats from user contribution. What I
mean is a way to have a broker to collect all statistics from an
organization internally and then to publish that in the end. With such
solution would allow to add value for enterprise statistics and also to
contribute in the end to Gentoo.

Each broker cloud use in the end git authentication to publish the
results with a merge request that would run the necessary hooks from
Gentoo side. We only need here a document specification for data parsing
in the end.

Sorry if my comment is completely out of context, but such an octoverse
for Gentoo would be very interesting in my perspective.

Best,

Samuel

[1] https://octoverse.github.com/

On 4/26/20 9:08 AM, Michał Górny wrote:
> Hi,
>
> The topic of rebooting gentoostats comes here from time to time.  Unless
> I'm mistaken, all the efforts so far were superficial, lacking a clear
> plan and unwilling to research the problems.  I'd like to start
> a serious discussion focused on the issues we need to solve, and propose
> some ideas how we could solve them.
>
> I can't promise I'll find time to implement it.  However, I'd like to
> get a clear plan on how it should be done if someone actually does it.
>
>
> The big questions
> =================
> The way I see it, the primary goal of the project would be to gather
> statistics on popularity of packages, in order to help us prioritize our
> attention and make decisions on what to keep and what to remove.  Unlike
> Debian's popcon, I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
>
> There are a few important questions that need to be answered first:
>
> 1. Which data do we need to collect?
>
>    a. list of installed packages?
>    b. versions (or slots?) of installed packages?
>    c. USE flags on installed packages?
>    d. world and world_sets files
>    e. system profile?
>    f. enabled repositories? (possibly filtered to official list)
>    g. distribution?
>
> I think d. is most important as it gives us information on what users
> really want.  a. alone is kinda redundant is we have d.  c. might have
> some value when deciding whether to mask a particular flag (and implies
> a.).
>
> e. would be valuable if we wanted to determine the future of particular
> profiles, as well as e.g. estimate the transition to new versions.
>
> f. would be valuable to determine which repositories are used but we
> need to filter private repos from the output for privacy reasons.
>
> g. could be valuable in correlation with other data but not sure if
> there's much direct value alone.
>
>
> 2. How to handle Gentoo derivatives?  Some of them could provide
> meaningful data but some could provide false data (e.g. when derivatives
> override Gentoo packages).  One possible option would be to filter a.-e. 
> to stuff coming from ::gentoo.
>
>
> 3. How to keep the data up-to-date?  After all, if we just stack a lot
> of old data, we will soon stop getting meaningful results.  I suppose
> we'll need to timestamp all data and remove old entries.
>
>
> 4. How to avoid duplication?  If some users submit their results more
> often than others, they would bias the results.  3. might be related.
>
>
> 5. How to handle clusters?  Things are simple if we can assume that
> people will submit data for a few distinct systems.  But what about
> companies that run 50 Gentoo machines with the same or similar setup? 
> What about clusters of 1000 almost identical containers?  Big entities
> could easily bias the results but we should also make it possible for
> them to participate somehow.
>
>
> 6. Security.  We don't want to expose information that could be
> correlated to specific systems, as it could disclose their
> vulnerabilities.
>
>
> 7. Privacy.  Besides the above, our sysadmins would appreciate if
> the data they submitted couldn't be easily correlated to them.  If we
> don't respect privacy of our users, we won't get them to submit data.
>
>
> 8. Spam protection.  Finally, the service needs to be resilient to being
> spammed with fake data.  Both to users who want to make their packages
> look more important, and to script kiddies that want to prove a point.
>
>
> My (partial) implementation idea
> ================================
> I think our approach should be oriented on privacy/security first,
> and attempt to make the best of the data we can get while respecting
> this principle.  This means no correlation and no tracking.
>
> Once the tool is installed, the user needs to opt-in to using it.  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would suggest a (random?) time for submission to take place periodically
> (say, every week).
>
> The submission would contain only raw data, without any identification
> information.  It would be encrypted using our public key.  Once
> uploaded, it would be put into our input queue as-is.
>
> Periodically the input queue would be processed in bulk.  The individual
> statistics would be updated and the input would be discarded.  This
> should prevent people trying to correlate changes in statistics with
> individual uploads.
>
> Each counted item would have a timestamp associated, and we'd discard
> old items per resubmission period.  This should ensure that we keep
> fresh data and people can update their earlier submissions without
> storing identification data.
>
> For example, N users submit their data containing a list of packages
> every week.  This data is used in bulk to update counts of individual
> packages (technically, to append timestamps to list corresponding to
> these packages).  Data older than one week is discarded, so we have
> rough counts of package use during the last week.
>
> I think this addresses problems 3./6./7.
>
>
> The other major problem is spam protection.  The best semi-anonymous way
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> We could set a limit of, say, 10 submissions per IPv4 address per week. 
> If some address would exceed that limit, we could require CAPTCHA
> authorization.
>
> I think this would make spamming a bit harder while keeping submissions
> easy for the most, and a little harder but possible for those of us
> behind ISP NATs.
>
> This should address problems 4./8. and maybe 5. to some degree.
>
>
> A proper solution to cluster problem would probably involve some way to
> internally collect and combine data data before submission.  If you have
> large clusters of similar systems, I think you'd want to have all
> packages used on different systems reported as one entry.
>
>
> I think we should collect data from users running all Gentoo
> derivatives, as long as they are using Gentoo packages.  The simplest
> solution I can think of would be to filter the results on packages (or
> profiles) installed from ::gentoo.  This will work only for distros that
> expose ::gentoo explicitly (vs copying our ebuilds to their
> repositories) though.
>
>
> What do you think?  Do you foresee other problems?  Do you have other
> needs?  Can you think of better solutions?
>

signature.asc
Description: OpenPGP digital signature

Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

Reply via email to