Collecting data from Fedora user community

przemek klosowski via devel Wed, 07 Jul 2021 13:26:05 -0700

We had several discussions recently that could use some real-world dataon e.g.:


- x86_64-v2 prevalence


- GUILE usage in make/gdb

- count of systems with UEFI/GPT vs BIOS/MBR

- debugd server usage

- etc

The common thread is that some sort of measurement would help figuringout the best technical solution for the Fedora community, but suchmeasurement would require transmitting and collecting data in the Fedorainfrastructure. Such telemetry is usually criticized from two relatedangles: a general objection to online telemetry, and a practicalargument about the complicated legal ramifications of online datacollection in multiple jurisdictions. Stephen recently responded with aneloquent argument why it's very hard to come up with an acceptablecollection scheme (see below).

This problem is of course not unique to Fedora---everyone is in the sameboat of finding a middle ground between anonymity and indiscriminatedata collection. I think most people would agree that data-drivendecisions are better than gut feeling-based ones, so it is to ourbenefit to control and possibly allow _some_ data to be used for thatpurpose.

Few days ago I attended a talk by a practicing data scientist workingwith social data where the privacy issues are even more important: someof the data can literally have life-altering consequences to peoplecovered by the collection. I asked him about the best practices andguidelines for responsible data collection, and he directed me to thispresentation:


https://the-engine-room.github.io/responsible-data-handbook/pages/slides.html

My personal take-aways from reading this material are:

- we're not alone in this---other people have thought through theseissues and came up with workable ideas

- it helps to keep things in perspective----the data about people'scomputers is less consequential than e.g data about their ethnicity orpolitics

- there are legal requirements for Consent (notice/disclosure): they areworkable, though

- it matters how the data is used: publishing full logs vs. using theaggregated data for internal improvement


Anyway, this is my personal, uneducated take on it. Hope it is helpful.


On 6/18/21 9:39 AM, Stephen John Smoogen wrote:

On Fri, 18 Jun 2021 at 01:51, Gerd Hoffmann <kra...@redhat.com> wrote:

Hi,

The problems with this is that we are taking a fairly fuzzy data set
and making it much easier to track individual users in ways seen as
problematic by various laws and regulations.

Well, depends on how you store the data.  You can store one record per
machine (with all properties in there), or you can store one record per
property per machine.

With the latter you basically kill query on subgroups (like "how many
x86_64-v3 machines use UEFI?") because that grouping information is gone
if you store each end every little piece of information in its own
record.  But it'll also much harder to do fingerprinting on such a data
base ...

Standard disclaimer: IANAL.

The problem with IANAL, is that we all come up with great solutions
which seem to match the single document we read. However the law is an
interpreted language where every court is a slightly different
architecture and has different libraries which have to be slowly
interpreted and patched at a top level. This means that you end up
with finding out that the document and 2500 years of law rulings have
to be interpreted together.

The way things are interpreted currently, it doesn't matter that you
stored it differently.. it matters that you collected it... mainly
because there is a long history of people finding ways to de-anonymize
data, people lying about anonymizing it, and people somehow collecting
the data in the middle. Because of that you end up having to delete
all the data when someone asks to be deleted because you can't prove
this record/count was their system or not.

In general we computer people like to dive in and just collect data
and go about doing analysis. The various privacy laws are written to
make us do a LOT of hard work before we start doing that. You end up
spending a lot of time with lawyers versed in European, Brazilian, and
various other countries laws/regulations/past history to figure out
what you can collect, how you can collect it, how you are going to
delete it, how you are going to inform people that things are
happening, and having clear processes that are followed. Then you can
start writing the code.. while doing that you have to review the code
to make sure it is still meeting current rulings.  [Doing it another
way ends up with you writing code and either finding you have to
delete it all or waiting months for an approval before rolling it
out.]

_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Collecting data from Fedora user community

Reply via email to