Hi Dario, I confess I'm not personally very concerned, I just fear the nebulous community might be ;-). We've been very good at seeing "Wikimedia doesn't have the resources to record/analyse this information" and hearing "Wikimedia is so privacy-driven it doesn't *want* to record/analyse this information"...
The proposed safeguards do look sensible. Will you be releasing all pages, or just ns0->ns0? One datapoint that would also be worth considering generating is the number of clickthroughs from xx.wiki to yy.wiki - even without recording the page titles, this could be very interesting. Andrew. On 13 January 2015 at 00:42, Dario Taraborelli <[email protected]> wrote: > Hey Andrew, > > that’s a great question. I asked Legal to review the implications of > publicly releasing a snapshot of this data and I’ll post the outcome of the > audit on this list. FWIW the data in question will be aggregated from the > logs of raw HTTP request that WMF passively receives. This is the same type > of data we previously used for the presentation on readership trends the > Analytics Team gave at Monthly Metrics in December [1] The format of the > logs and the data they contain is described here [2] > > Personally identifiable information (such as IP addresses or User Agents) > will not be used other than for the purpose of filtering bots and automated > requests: clickthrough data will be obtained by parsing and counting > specific string occurrences (such as an article title) in the referer string > of an HTTP request. In other words, we will be counting and aggregating > occurrences of requests for article B having article A as a string in the > referral. I’ll work with Ellery to release the code of the log parsing > script so it can be publicly reviewed before we move forward. > > Hope this addresses your concerns, > > Dario > > [1] > https://meta.wikimedia.org/w/index.php?title=File:2014_Readership_Update,_WMF_Metrics_Meeting,_December.pdf&page=10 > [2] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive > > On Jan 12, 2015, at 1:27 PM, Andrew Gray <[email protected]> wrote: > > Hi all, > > I'm curious about the privacy implications as well. I can't think of > specific problems with this data, *but* it's information that I didn't > think we'd ever been logging. We've historically been quite hands-off > with any kind of reader information, other than raw hit counts, and > there might well be some community discomfort at discovering it's been > both tracked and released, even if completely anonymised. > > Andrew. > > On 12 January 2015 at 20:08, Toby Negrin <[email protected]> wrote: > > Thanks Amir -- feel free to have your friend reach out to this list > directly. > > As Ellery said, we're figuring our if there are any privacy implications in > releasing this dataset. > > -Toby > > On Mon, Jan 12, 2015 at 12:05 PM, Amir E. Aharoni > <[email protected]> wrote: > > > I am asking for a real-life friend who is doing some research. It's not > for any particular project of mine, but I can easily imagine that it can be > useful for a lot of editors and product managers as I wrote in the opening > post. > > (And I cannot think of any privacy problems if the data is not tied to any > particular people, but maybe I'm naive.) > > > -- > Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי > http://aharoni.wordpress.com > “We're living in pieces, > I want to live in peace.” – T. Moore > > 2015-01-12 22:00 GMT+02:00 Toby Negrin <[email protected]>: > > > Hi Amir -- > > Would you like to see these datasets released publicly or was there a > specific project you were interested in using them for? > > thanks, > > -Toby > > On Mon, Jan 12, 2015 at 5:44 AM, Amir E. Aharoni > <[email protected]> wrote: > > > Hi, > > Are there metrics about which links in each article are the most > clicked? > > I can think there's a lot to be learned from it: > * Data-driven suggestions for manual of style about linking (too much > and too few links are a perennial topic of argument) > * How do people traverse between topics. > * Which terms in the article may need a short explanation in parentheses > rather than just a link. > * How far down into the article do people bother to read. > > Anyway, I can think that accessibility to such data can optimize both > readership and editing. > > And maybe this can be just taken right from the logs, without any > additional EventLogging. > > -- > Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי > http://aharoni.wordpress.com > “We're living in pieces, > I want to live in peace.” – T. Moore > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > - Andrew Gray > [email protected] > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- - Andrew Gray [email protected] _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
