Re: [Analytics] [Discussion] User agent data releases

Dario Taraborelli Thu, 05 Mar 2015 15:34:15 -0800

heads up that after a review with Legal we decided that we should not release 
the sampled raw dataset. Oliver is now working on making parsed UA data 
available.


> On Mar 5, 2015, at 10:52 AM, Oliver Keyes <[email protected]> wrote:
> 
> Just a clarifying note: Dario still needs to review the actual
> methodology. While Legal have approved it from their end, they've also
> made clear that this is contingent on the anonymisation methodology
> pasting muster from an R&D point of view.
> 
> On 5 March 2015 at 12:39, Oliver Keyes <[email protected]> wrote:
>> Just an FYI that Legal have approved this release under the
>> anonymisation procedures we've set out (thanks Michelle!) on the
>> condition that Dario, too, is comfortable with them. Dario?
>> 
>> On 4 March 2015 at 17:16, Oliver Keyes <[email protected]> wrote:
>>> So it's distinct people, globally - and I deliberately made it wooly
>>> it by operating over username, which means the threshold is fuzzy
>>> (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).
>>> 
>>> It's very deliberately dimension-free: user_agent,
>>> edit_count_in_non_specified_90_day_period, and that's it.
>>> 
>>> On 4 March 2015 at 17:12, Aaron Halfaker <[email protected]> wrote:
>>>> Assuming this was public, I could use this data on seldom edited Wikis to
>>>> find out which editors likely have old browser/OS versions with
>>>> vulnerabilities that I could attack[1].  This would be easier and easier 
>>>> the
>>>> more dimensions you add to the data.
>>>> 
>>>> <re-reads>
>>>> 
>>>> OK.  The anonymization strategy for dropping records that represent < 50
>>>> distinct editors seems to address this concern.   50 edits is a lot.  So
>>>> this data wouldn't be too terribly useful for under-active wikis.  Then
>>>> again, if you just want to a sense for what the dominant browser/OS pairs
>>>> are, then they will likely represent > 50 unique editors on most projects.
>>>> 
>>>> 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
>>>> implications of that one.
>>>> 
>>>> On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes <[email protected]> wrote:
>>>>> 
>>>>> Yeah, makes sense.
>>>>> 
>>>>> On 3 March 2015 at 20:38, Nuria Ruiz <[email protected]> wrote:
>>>>>>> Agreed. Do we have a way of syncing files to Labs yet?
>>>>>> No need to sync if file is available in an endpoint like
>>>>>> htpp://some-data-here
>>>>>> 
>>>>>> On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>> On 3 March 2015 at 19:35, Nuria Ruiz <[email protected]> wrote:
>>>>>>>>> Erik has asked me to write an exploratory app for user-agent data.
>>>>>>>>> The
>>>>>>>>> idea is to enable Product Managers and engineers to easily explore
>>>>>>>>> what users use so they know what to support. I've thrown up an
>>>>>>>>> example
>>>>>>>>> screenshot at http://ironholds.org/agents_example_screen.png
>>>>>>>> 
>>>>>>>> I cannot speak as to the interest of community about this data but
>>>>>>>> for
>>>>>>>> developers and PM we should make sure we have a solid way to update
>>>>>>>> any
>>>>>>>> data
>>>>>>>> we put up. User Agent data is outdated as soon as a new version of
>>>>>>>> android
>>>>>>>> or iOs is released, a new popular phone comes along or a new
>>>>>>>> autoupdate
>>>>>>>> for
>>>>>>>> popular browsers. Not only that, if we make changes to, say, redirect
>>>>>>>> all
>>>>>>>> iPad users to the desktop site we want to asses effect of those
>>>>>>>> changes
>>>>>>>> as
>>>>>>>> soon as possible. A monthly update will be a must. Also
>>>>>>>> distinguishing
>>>>>>>> between browser percentages on desktop site versus mobile site versus
>>>>>>>> apps
>>>>>>>> is a must for this data to be real useful for PMs and developers
>>>>>>>> (specially
>>>>>>>> for bug triage).
>>>>>>>> 
>>>>>>> 
>>>>>>> Yes! However, I am addressing a specific ad-hoc request. If there is a
>>>>>>> need for this (I agree there is) I hope Toby and Kevin can eke out the
>>>>>>> time on the Analytics Engineering schedule to work on it; y'all are a
>>>>>>> lot better at infrastructure work than me :).
>>>>>>> 
>>>>>>>> 
>>>>>>>> We have couple backlog items to make monthly reports on this regard.
>>>>>>>> A
>>>>>>>> UI on
>>>>>>>> top of them will be superb.
>>>>>>>> 
>>>>>>> 
>>>>>>> Agreed. Do we have a way of syncing files to Labs yet? That's the
>>>>>>> biggest blocker. The UI doesn't care what the file contains as long as
>>>>>>> it's a TSV with a header row - I've deliberately built it so that
>>>>>>> things like the download links are dynamic and can change.
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes <[email protected]>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hey all,
>>>>>>>>> 
>>>>>>>>> (Sending this to the public list because it's more transparent and
>>>>>>>>> I'd
>>>>>>>>> like people who think this data is useful to be able to shout out)
>>>>>>>>> 
>>>>>>>>> Erik has asked me to write an exploratory app for user-agent data.
>>>>>>>>> The
>>>>>>>>> idea is to enable Product Managers and engineers to easily explore
>>>>>>>>> what users use so they know what to support. I've thrown up an
>>>>>>>>> example
>>>>>>>>> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
>>>>>>>>> host it on Commons, inb4Dario, but I'm not sure the copyright status
>>>>>>>>> of the UI)
>>>>>>>>> 
>>>>>>>>> One side-effect of this is that we end up with files of common user
>>>>>>>>> agents, split between {readers,editors} and {mobile, desktop},
>>>>>>>>> parsed
>>>>>>>>> and unparsed. I'd like to release these files. The reuse potential
>>>>>>>>> is
>>>>>>>>> twofold; researchers and engineers can use the parsed files to see
>>>>>>>>> what browser penetration looks like globally and what browsers
>>>>>>>>> should
>>>>>>>>> be supported at a top-10, and software engineers can use the
>>>>>>>>> unparsed
>>>>>>>>> files to improve detection rates.
>>>>>>>>> 
>>>>>>>>> The privacy implications /should/ be minimal, because of how this
>>>>>>>>> data
>>>>>>>>> is gathered. The editor data is gathered from the checkuser table,
>>>>>>>>> globally, and automatically excludes any user agent used by fewer
>>>>>>>>> than
>>>>>>>>> 50 distinct usernames. The reader data is gathered from a month of
>>>>>>>>> 1:1000 sampled log files, and excludes any agent responsible for
>>>>>>>>> fewer
>>>>>>>>> than 500 pageviews in a 24 hour period (except, sampled. So,
>>>>>>>>> practically speaking, that's 500,000 pageviews)
>>>>>>>>> 
>>>>>>>>> What do people think about making this a data release? Would people
>>>>>>>>> get value from the data, as well as the tool?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Oliver Keyes
>>>>>>>>> Research Analyst
>>>>>>>>> Wikimedia Foundation
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Oliver Keyes
>>>>>>> Research Analyst
>>>>>>> Wikimedia Foundation
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Oliver Keyes
>>>>> Research Analyst
>>>>> Wikimedia Foundation
>>>>> 
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>> 
>> 
>> 
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Discussion] User agent data releases

Reply via email to