Hey Yuvi, this sounds like very interesting data to look at. Here are my thoughts:
- the Anonymization scheme sounds reasonable, and I'd like to hear from someone else @ wikimedia who has similar experience anonymizing data sets - you were probably already thinking about it, but we need documentation too: a wikipage with the name of the table, data dictionary, etc... and even a blog post to announce the newly available data. On Sun, Aug 24, 2014 at 5:21 PM, Yuvi Panda <[email protected]> wrote: > Hello! > > I've been working for the last few days on > https://github.com/Ironholds/WPDMZ, which currently generates raw data > on 'number of non-bot edits per country', and I'd like to run some > stats / make some graphs based on it. Since I'd like al l my > 'research' to be completely repeatable, I'd love it if we can make the > 'raw data' (edits per country) publicly available on labsdb. I have > most of the code written for it, *but* it needs anonymization. > > The biggest de-anonymization threats involve identifying which editors > come from which countries, and can be executed in the following case: > > An editor is the only person editing from a country in a project where > the country has low edit volume, and by a process of elimination / > counting edits from a public source (like recentchanges), the > individual editor can be connected to a particular country > > I propose the following Anonymization scheme: > > 1. No data for projects with less than a threshold of total > *individual editors* in the time period for which the data is > released. > 2. For countries that have less than a threshold % of 'individual > editors' in the time period, we just simply lump them in as 'other'. > > This removes most anonymization attacks I can think of. Thoughts? I > can easily write up the code to generate these on a monthly basis and > puppetize those to make the data publicly available. I think not just > me, but lots of external researchers would benefit from such data. > > Thanks! > > -- > Yuvi Panda T > http://yuvi.in/blog > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
