Just an FYI that Legal have approved this release under the anonymisation procedures we've set out (thanks Michelle!) on the condition that Dario, too, is comfortable with them. Dario?
On 4 March 2015 at 17:16, Oliver Keyes <[email protected]> wrote: > So it's distinct people, globally - and I deliberately made it wooly > it by operating over username, which means the threshold is fuzzy > (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]). > > It's very deliberately dimension-free: user_agent, > edit_count_in_non_specified_90_day_period, and that's it. > > On 4 March 2015 at 17:12, Aaron Halfaker <[email protected]> wrote: >> Assuming this was public, I could use this data on seldom edited Wikis to >> find out which editors likely have old browser/OS versions with >> vulnerabilities that I could attack[1]. This would be easier and easier the >> more dimensions you add to the data. >> >> <re-reads> >> >> OK. The anonymization strategy for dropping records that represent < 50 >> distinct editors seems to address this concern. 50 edits is a lot. So >> this data wouldn't be too terribly useful for under-active wikis. Then >> again, if you just want to a sense for what the dominant browser/OS pairs >> are, then they will likely represent > 50 unique editors on most projects. >> >> 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the >> implications of that one. >> >> On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes <[email protected]> wrote: >>> >>> Yeah, makes sense. >>> >>> On 3 March 2015 at 20:38, Nuria Ruiz <[email protected]> wrote: >>> >>Agreed. Do we have a way of syncing files to Labs yet? >>> > No need to sync if file is available in an endpoint like >>> > htpp://some-data-here >>> > >>> > On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes <[email protected]> >>> > wrote: >>> >> >>> >> On 3 March 2015 at 19:35, Nuria Ruiz <[email protected]> wrote: >>> >> >>Erik has asked me to write an exploratory app for user-agent data. >>> >> >> The >>> >> >>idea is to enable Product Managers and engineers to easily explore >>> >> >>what users use so they know what to support. I've thrown up an >>> >> >> example >>> >> >>screenshot at http://ironholds.org/agents_example_screen.png >>> >> > >>> >> > I cannot speak as to the interest of community about this data but >>> >> > for >>> >> > developers and PM we should make sure we have a solid way to update >>> >> > any >>> >> > data >>> >> > we put up. User Agent data is outdated as soon as a new version of >>> >> > android >>> >> > or iOs is released, a new popular phone comes along or a new >>> >> > autoupdate >>> >> > for >>> >> > popular browsers. Not only that, if we make changes to, say, redirect >>> >> > all >>> >> > iPad users to the desktop site we want to asses effect of those >>> >> > changes >>> >> > as >>> >> > soon as possible. A monthly update will be a must. Also >>> >> > distinguishing >>> >> > between browser percentages on desktop site versus mobile site versus >>> >> > apps >>> >> > is a must for this data to be real useful for PMs and developers >>> >> > (specially >>> >> > for bug triage). >>> >> > >>> >> >>> >> Yes! However, I am addressing a specific ad-hoc request. If there is a >>> >> need for this (I agree there is) I hope Toby and Kevin can eke out the >>> >> time on the Analytics Engineering schedule to work on it; y'all are a >>> >> lot better at infrastructure work than me :). >>> >> >>> >> > >>> >> > We have couple backlog items to make monthly reports on this regard. >>> >> > A >>> >> > UI on >>> >> > top of them will be superb. >>> >> > >>> >> >>> >> Agreed. Do we have a way of syncing files to Labs yet? That's the >>> >> biggest blocker. The UI doesn't care what the file contains as long as >>> >> it's a TSV with a header row - I've deliberately built it so that >>> >> things like the download links are dynamic and can change. >>> >> >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes <[email protected]> >>> >> > wrote: >>> >> >> >>> >> >> Hey all, >>> >> >> >>> >> >> (Sending this to the public list because it's more transparent and >>> >> >> I'd >>> >> >> like people who think this data is useful to be able to shout out) >>> >> >> >>> >> >> Erik has asked me to write an exploratory app for user-agent data. >>> >> >> The >>> >> >> idea is to enable Product Managers and engineers to easily explore >>> >> >> what users use so they know what to support. I've thrown up an >>> >> >> example >>> >> >> screenshot at http://ironholds.org/agents_example_screen.png (I'd >>> >> >> host it on Commons, inb4Dario, but I'm not sure the copyright status >>> >> >> of the UI) >>> >> >> >>> >> >> One side-effect of this is that we end up with files of common user >>> >> >> agents, split between {readers,editors} and {mobile, desktop}, >>> >> >> parsed >>> >> >> and unparsed. I'd like to release these files. The reuse potential >>> >> >> is >>> >> >> twofold; researchers and engineers can use the parsed files to see >>> >> >> what browser penetration looks like globally and what browsers >>> >> >> should >>> >> >> be supported at a top-10, and software engineers can use the >>> >> >> unparsed >>> >> >> files to improve detection rates. >>> >> >> >>> >> >> The privacy implications /should/ be minimal, because of how this >>> >> >> data >>> >> >> is gathered. The editor data is gathered from the checkuser table, >>> >> >> globally, and automatically excludes any user agent used by fewer >>> >> >> than >>> >> >> 50 distinct usernames. The reader data is gathered from a month of >>> >> >> 1:1000 sampled log files, and excludes any agent responsible for >>> >> >> fewer >>> >> >> than 500 pageviews in a 24 hour period (except, sampled. So, >>> >> >> practically speaking, that's 500,000 pageviews) >>> >> >> >>> >> >> What do people think about making this a data release? Would people >>> >> >> get value from the data, as well as the tool? >>> >> >> >>> >> >> -- >>> >> >> Oliver Keyes >>> >> >> Research Analyst >>> >> >> Wikimedia Foundation >>> >> >> >>> >> >> _______________________________________________ >>> >> >> Analytics mailing list >>> >> >> [email protected] >>> >> >> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> > >>> >> > >>> >> > >>> >> > _______________________________________________ >>> >> > Analytics mailing list >>> >> > [email protected] >>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> > >>> >> >>> >> >>> >> >>> >> -- >>> >> Oliver Keyes >>> >> Research Analyst >>> >> Wikimedia Foundation >>> >> >>> >> _______________________________________________ >>> >> Analytics mailing list >>> >> [email protected] >>> >> https://lists.wikimedia.org/mailman/listinfo/analytics >>> > >>> > >>> > >>> > _______________________________________________ >>> > Analytics mailing list >>> > [email protected] >>> > https://lists.wikimedia.org/mailman/listinfo/analytics >>> > >>> >>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
