Hi Dario,

I confess I'm not personally very concerned, I just fear the nebulous
community might be ;-). We've been very good at seeing "Wikimedia
doesn't have the resources to record/analyse this information" and
hearing "Wikimedia is so privacy-driven it doesn't *want* to
record/analyse this information"...

The proposed safeguards do look sensible. Will you be releasing all
pages, or just ns0->ns0?

One datapoint that would also be worth considering generating is the
number of clickthroughs from xx.wiki to yy.wiki - even without
recording the page titles, this could be very interesting.

Andrew.


On 13 January 2015 at 00:42, Dario Taraborelli
<[email protected]> wrote:
> Hey Andrew,
>
> that’s a great question. I asked Legal to review the implications of
> publicly releasing a snapshot of this data and I’ll post the outcome of the
> audit on this list. FWIW the data in question will be aggregated from the
> logs of raw HTTP request that WMF passively receives. This is the same type
> of data we previously used for the presentation on readership trends the
> Analytics Team gave at Monthly Metrics in December [1] The format of the
> logs and the data they contain is described here [2]
>
> Personally identifiable information (such as IP addresses or User Agents)
> will not be used other than for the purpose of filtering bots and automated
> requests: clickthrough data will be obtained by parsing and counting
> specific string occurrences (such as an article title) in the referer string
> of an HTTP request. In other words, we will be counting and aggregating
> occurrences of requests for article B having article A as a string in the
> referral. I’ll work with Ellery to release the code of the log parsing
> script so it can be publicly reviewed before we move forward.
>
> Hope this addresses your concerns,
>
> Dario
>
> [1]
> https://meta.wikimedia.org/w/index.php?title=File:2014_Readership_Update,_WMF_Metrics_Meeting,_December.pdf&page=10
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive
>
> On Jan 12, 2015, at 1:27 PM, Andrew Gray <[email protected]> wrote:
>
> Hi all,
>
> I'm curious about the privacy implications as well. I can't think of
> specific problems with this data, *but* it's information that I didn't
> think we'd ever been logging. We've historically been quite hands-off
> with any kind of reader information, other than raw hit counts, and
> there might well be some community discomfort at discovering it's been
> both tracked and released, even if completely anonymised.
>
> Andrew.
>
> On 12 January 2015 at 20:08, Toby Negrin <[email protected]> wrote:
>
> Thanks Amir -- feel free to have your friend reach out to this list
> directly.
>
> As Ellery said, we're figuring our if there are any privacy implications in
> releasing this dataset.
>
> -Toby
>
> On Mon, Jan 12, 2015 at 12:05 PM, Amir E. Aharoni
> <[email protected]> wrote:
>
>
> I am asking for a real-life friend who is doing some research. It's not
> for any particular project of mine, but I can easily imagine that it can be
> useful for a lot of editors and product managers as I wrote in the opening
> post.
>
> (And I cannot think of any privacy problems if the data is not tied to any
> particular people, but maybe I'm naive.)
>
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> 2015-01-12 22:00 GMT+02:00 Toby Negrin <[email protected]>:
>
>
> Hi Amir --
>
> Would you like to see these datasets released publicly or was there a
> specific project you were interested in using them for?
>
> thanks,
>
> -Toby
>
> On Mon, Jan 12, 2015 at 5:44 AM, Amir E. Aharoni
> <[email protected]> wrote:
>
>
> Hi,
>
> Are there metrics about which links in each article are the most
> clicked?
>
> I can think there's a lot to be learned from it:
> * Data-driven suggestions for manual of style about linking (too much
> and too few links are a perennial topic of argument)
> * How do people traverse between topics.
> * Which terms in the article may need a short explanation in parentheses
> rather than just a link.
> * How far down into the article do people bother to read.
>
> Anyway, I can think that accessibility to such data can optimize both
> readership and editing.
>
> And maybe this can be just taken right from the logs, without any
> additional EventLogging.
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> --
> - Andrew Gray
>  [email protected]
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
- Andrew Gray
  [email protected]

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to