heads up that after a review with Legal we decided that we should not release the sampled raw dataset. Oliver is now working on making parsed UA data available.
> On Mar 5, 2015, at 10:52 AM, Oliver Keyes <[email protected]> wrote: > > Just a clarifying note: Dario still needs to review the actual > methodology. While Legal have approved it from their end, they've also > made clear that this is contingent on the anonymisation methodology > pasting muster from an R&D point of view. > > On 5 March 2015 at 12:39, Oliver Keyes <[email protected]> wrote: >> Just an FYI that Legal have approved this release under the >> anonymisation procedures we've set out (thanks Michelle!) on the >> condition that Dario, too, is comfortable with them. Dario? >> >> On 4 March 2015 at 17:16, Oliver Keyes <[email protected]> wrote: >>> So it's distinct people, globally - and I deliberately made it wooly >>> it by operating over username, which means the threshold is fuzzy >>> (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]). >>> >>> It's very deliberately dimension-free: user_agent, >>> edit_count_in_non_specified_90_day_period, and that's it. >>> >>> On 4 March 2015 at 17:12, Aaron Halfaker <[email protected]> wrote: >>>> Assuming this was public, I could use this data on seldom edited Wikis to >>>> find out which editors likely have old browser/OS versions with >>>> vulnerabilities that I could attack[1]. This would be easier and easier >>>> the >>>> more dimensions you add to the data. >>>> >>>> <re-reads> >>>> >>>> OK. The anonymization strategy for dropping records that represent < 50 >>>> distinct editors seems to address this concern. 50 edits is a lot. So >>>> this data wouldn't be too terribly useful for under-active wikis. Then >>>> again, if you just want to a sense for what the dominant browser/OS pairs >>>> are, then they will likely represent > 50 unique editors on most projects. >>>> >>>> 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the >>>> implications of that one. >>>> >>>> On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes <[email protected]> wrote: >>>>> >>>>> Yeah, makes sense. >>>>> >>>>> On 3 March 2015 at 20:38, Nuria Ruiz <[email protected]> wrote: >>>>>>> Agreed. Do we have a way of syncing files to Labs yet? >>>>>> No need to sync if file is available in an endpoint like >>>>>> htpp://some-data-here >>>>>> >>>>>> On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> On 3 March 2015 at 19:35, Nuria Ruiz <[email protected]> wrote: >>>>>>>>> Erik has asked me to write an exploratory app for user-agent data. >>>>>>>>> The >>>>>>>>> idea is to enable Product Managers and engineers to easily explore >>>>>>>>> what users use so they know what to support. I've thrown up an >>>>>>>>> example >>>>>>>>> screenshot at http://ironholds.org/agents_example_screen.png >>>>>>>> >>>>>>>> I cannot speak as to the interest of community about this data but >>>>>>>> for >>>>>>>> developers and PM we should make sure we have a solid way to update >>>>>>>> any >>>>>>>> data >>>>>>>> we put up. User Agent data is outdated as soon as a new version of >>>>>>>> android >>>>>>>> or iOs is released, a new popular phone comes along or a new >>>>>>>> autoupdate >>>>>>>> for >>>>>>>> popular browsers. Not only that, if we make changes to, say, redirect >>>>>>>> all >>>>>>>> iPad users to the desktop site we want to asses effect of those >>>>>>>> changes >>>>>>>> as >>>>>>>> soon as possible. A monthly update will be a must. Also >>>>>>>> distinguishing >>>>>>>> between browser percentages on desktop site versus mobile site versus >>>>>>>> apps >>>>>>>> is a must for this data to be real useful for PMs and developers >>>>>>>> (specially >>>>>>>> for bug triage). >>>>>>>> >>>>>>> >>>>>>> Yes! However, I am addressing a specific ad-hoc request. If there is a >>>>>>> need for this (I agree there is) I hope Toby and Kevin can eke out the >>>>>>> time on the Analytics Engineering schedule to work on it; y'all are a >>>>>>> lot better at infrastructure work than me :). >>>>>>> >>>>>>>> >>>>>>>> We have couple backlog items to make monthly reports on this regard. >>>>>>>> A >>>>>>>> UI on >>>>>>>> top of them will be superb. >>>>>>>> >>>>>>> >>>>>>> Agreed. Do we have a way of syncing files to Labs yet? That's the >>>>>>> biggest blocker. The UI doesn't care what the file contains as long as >>>>>>> it's a TSV with a header row - I've deliberately built it so that >>>>>>> things like the download links are dynamic and can change. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hey all, >>>>>>>>> >>>>>>>>> (Sending this to the public list because it's more transparent and >>>>>>>>> I'd >>>>>>>>> like people who think this data is useful to be able to shout out) >>>>>>>>> >>>>>>>>> Erik has asked me to write an exploratory app for user-agent data. >>>>>>>>> The >>>>>>>>> idea is to enable Product Managers and engineers to easily explore >>>>>>>>> what users use so they know what to support. I've thrown up an >>>>>>>>> example >>>>>>>>> screenshot at http://ironholds.org/agents_example_screen.png (I'd >>>>>>>>> host it on Commons, inb4Dario, but I'm not sure the copyright status >>>>>>>>> of the UI) >>>>>>>>> >>>>>>>>> One side-effect of this is that we end up with files of common user >>>>>>>>> agents, split between {readers,editors} and {mobile, desktop}, >>>>>>>>> parsed >>>>>>>>> and unparsed. I'd like to release these files. The reuse potential >>>>>>>>> is >>>>>>>>> twofold; researchers and engineers can use the parsed files to see >>>>>>>>> what browser penetration looks like globally and what browsers >>>>>>>>> should >>>>>>>>> be supported at a top-10, and software engineers can use the >>>>>>>>> unparsed >>>>>>>>> files to improve detection rates. >>>>>>>>> >>>>>>>>> The privacy implications /should/ be minimal, because of how this >>>>>>>>> data >>>>>>>>> is gathered. The editor data is gathered from the checkuser table, >>>>>>>>> globally, and automatically excludes any user agent used by fewer >>>>>>>>> than >>>>>>>>> 50 distinct usernames. The reader data is gathered from a month of >>>>>>>>> 1:1000 sampled log files, and excludes any agent responsible for >>>>>>>>> fewer >>>>>>>>> than 500 pageviews in a 24 hour period (except, sampled. So, >>>>>>>>> practically speaking, that's 500,000 pageviews) >>>>>>>>> >>>>>>>>> What do people think about making this a data release? Would people >>>>>>>>> get value from the data, as well as the tool? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Oliver Keyes >>>>>>>>> Research Analyst >>>>>>>>> Wikimedia Foundation >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Analytics mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Oliver Keyes >>>>>>> Research Analyst >>>>>>> Wikimedia Foundation >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Oliver Keyes >>>>> Research Analyst >>>>> Wikimedia Foundation >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
