Thanks very much, Toby and everyone.

Ironholds, I appreciate your doing traffic research on a volunteer basis
for the benefit of the Signpost and the community. I'm concerned about the
system as a whole may need a closer look, and I'm glad that Toby will be
doing this with input from Legal.

Toby: I hope we can continue to get some Ironholds-sponsored filtering for
the Traffic Report, although we may need to get it with some additional
conditions attached.

Thanks and regards,

Pine

On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin <[email protected]> wrote:

> Folks --
>
> While I'm pleased that this validation was being done by a team member
> with full knowledge of our privacy and data retention policies, I think
> some good points have been raised that we're going to need to discuss as a
> team. I've reached out to legal for their assistance is figuring out the
> path forward.
>
> -Toby
>
> On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <[email protected]>
> wrote:
>
>> I see - Oliver's batman.  Nothing to see here, moving on.
>>
>> On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <[email protected]>
>> wrote:
>>
>>> I should also point out that "Toby not knowing who the staffer doing
>>> this one, highly specific, very minor piece of data-dogging is" does not
>>> equate to analytics not knowing who it is. I don't know what you do for a
>>> living but do you tend to give your boss's boss a constant play-by-play,
>>> or? ;p. It's documented in Trello just like everything else.
>>>
>>> On 17 October 2014 16:55, Oliver Keyes <[email protected]> wrote:
>>>
>>>> It's me. Hi! I'm sort of confused by this.
>>>>
>>>> In terms of shady back-alley data dealing, let me set out exactly what
>>>> happens.
>>>>
>>>> Every week, the signpost emails me a list of articles that have
>>>> unexpectedly high pageview counts and would be in the top 25, but nobody
>>>> can quite work out why they're so popular. I go through the logs for the
>>>> last week (I'd be unable to do this for any queries more than a month ago
>>>> anyway, since we only keep the unsampled data for that long, but a week is
>>>> what's relevant here), and pull out a tuple of {ip,referer,user
>>>> agent,article, requests} for the articles on that list.
>>>>
>>>> These tuples, which exist exclusively on our analytics machines (not
>>>> even my personal, encrypted work laptop: they're only stored server-side,
>>>> at all steps in this) are than hand-parsed by me. Can we pin all of the
>>>> requests for [article], or at least most of them, on a single IP address,
>>>> or a single {IP,user_agent} pair? Then it's probably a spammer or a spider
>>>> or an [expletive]. No? Okay, if we sum by referer, do we see a common
>>>> referer? If so, is that an actual referer or a fly-by-night live mirror?
>>>> Questions like that.
>>>>
>>>> When I'm done with all of the articles, I email the signpost with "for
>>>> article1, that looks legit. Article2 is a web crawler I'm going to email
>>>> and shout at. Article3 is a live mirror. Article4 looks legit.
>>>> Article5...". These requests are logged on our trello board, just like any
>>>> other data request from any other party, community or staff. Milowent and
>>>> the other signposters get zero IPs, zero user agents, and nothing anywhere
>>>> near that range of information: that stuff doesn't even leave the server.
>>>> And when I'm done with it, I nuke it so it's not even *there*.
>>>>
>>>> I hope that clarifies what's happening here. If you have specific
>>>> questions about what we keep that's obviously more of a question for
>>>> management.
>>>>
>>>> On 17 October 2014 12:27, Jonathan Morgan <[email protected]>
>>>> wrote:
>>>>
>>>>> Pine, have you considered asking Milowent who they work with on the IP
>>>>> data? I really, really doubt that there is some sort of shady back-alley
>>>>> data dealing going down here. - Jonathan
>>>>>
>>>>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <[email protected]> wrote:
>>>>>
>>>>>> Thanks Toby.
>>>>>>
>>>>>> I understand that IPs are not an especially accurate way to look at
>>>>>> unique visitors, but for the purposes of the Signpost's traffic report 
>>>>>> and
>>>>>> the Top 25 I feel that they are reasonable approximations of ways to 
>>>>>> filter
>>>>>> out what appear to be automated requests.
>>>>>>
>>>>>> I am ok with holding those logs for 30 days, although I am a little
>>>>>> surprised to hear that this is happening. However, what worries me a bit
>>>>>> more is the idea that a staff member can be accessing those logs without
>>>>>> that access being recorded. This might be something that you wish to
>>>>>> investigate further.
>>>>>>
>>>>>> I am not interested in getting this staff person into trouble. The
>>>>>> information that they are providing is useful to the Signpost and 
>>>>>> certainly
>>>>>> seems to be sanitized to a reasonable degree. However, it does concern me
>>>>>> that they can access these logs without someone knowing about it, it 
>>>>>> seems
>>>>>> to me that this sort of activity should be proactively disclosed to 
>>>>>> people
>>>>>> in WMF who conduct legal and security reviews, and I hope you will 
>>>>>> consider
>>>>>> what sort of security features are appropriate to make sure that 
>>>>>> occasions
>>>>>> when anyone accesses the raw logs are recorded in a robust manner. I 
>>>>>> worry
>>>>>> that if this one staffer can access logs without the higher-ups knowing
>>>>>> about it, it is possible that someone who intends to do unethical
>>>>>> activities with WMF's data could also access the logs without being 
>>>>>> noticed.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Pine
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Pine --
>>>>>>>
>>>>>>> Thanks for this -- it's a challenging topic but one that the
>>>>>>> Analytics team takes very seriously.
>>>>>>>
>>>>>>> I'm not familiar with the IP address review that's referenced in the
>>>>>>> link. I don't know who the staffer might be. We don't currently 
>>>>>>> calculate
>>>>>>> unique visitors to anything in Analytics and IP address is not a
>>>>>>> particularly accurate way to assess unique visitors regardless (due to
>>>>>>> proxies/NATs/etc).
>>>>>>>
>>>>>>> We do store IPs as part of page requests in our raw logs which are
>>>>>>> deleted every 30 days. This data is kept on a system where access is
>>>>>>> limited and controlled by the operations team. We're in line with the
>>>>>>> privacy policy on this.
>>>>>>>
>>>>>>> To be clear, we are currently considering mechanisms to count unique
>>>>>>> "requests" -- we rely on Comscore for this data and for several reasons,
>>>>>>> primarily related to mobile usage, it's not sufficient to understand our
>>>>>>> usage patterns. We are putting together some proposals to do this in as
>>>>>>> limited way as possible and that's respectful to our users. We'll share
>>>>>>> this with the community when we feel we understand the use cases and
>>>>>>> trade-offs well enough to discuss in an informed manner.
>>>>>>>
>>>>>>> -Toby
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We do store the IP address associated with varnish requests as part
>>>>>>> of the log. This data is
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi again Analytics,
>>>>>>>>
>>>>>>>> I was under the impression that no records are kept of which IPs
>>>>>>>> access which articles on Wikipedia when no edits are made, but it 
>>>>>>>> appears
>>>>>>>> that such records are in fact kept [1].
>>>>>>>>
>>>>>>>> Is this proper? This practice appears to be permissible under the
>>>>>>>> Privacy Policy which states that "We use IP addresses for research and
>>>>>>>> analytics; to better personalize content, notices, and settings for 
>>>>>>>> you; to
>>>>>>>> fight spam, identity theft, malware, and other kinds of abuse; and to
>>>>>>>> provide better mobile and other applications."
>>>>>>>>
>>>>>>>> It is possible that this information is relevant for determining
>>>>>>>> the number of unique visitors that Wikipedia gets and that this 
>>>>>>>> information
>>>>>>>> is always properly filtered before it gets to the Signpost. However, 
>>>>>>>> given
>>>>>>>> recent discussions which I thought said that Wikipedia was not 
>>>>>>>> instrumented
>>>>>>>> to track unique visitors, I am surprised to learn that this already 
>>>>>>>> seems
>>>>>>>> to be happening and that the situation has been this way for some 
>>>>>>>> time, so
>>>>>>>> I would appreciate clarification.
>>>>>>>>
>>>>>>>> I want to emphasize that this question is about clarifying the
>>>>>>>> practice of tracking likely unique visitors by IP. This question is not
>>>>>>>> intended to start flame wars, get people into trouble, or limit the
>>>>>>>> Signpost's access to properly filtered information if there has been a
>>>>>>>> determination that WMF's retention of the raw data is appropriate. 
>>>>>>>> There
>>>>>>>> might be appropriate secondary questions about making sure that access 
>>>>>>>> to
>>>>>>>> the raw IP access data is carefully contained and secured.
>>>>>>>>
>>>>>>>> Thank you very much,
>>>>>>>>
>>>>>>>> Pine
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&diff=629934257&oldid=629932288
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jonathan T. Morgan
>>>>> Learning Strategist
>>>>> Wikimedia Foundation
>>>>> User:Jmorgan (WMF)
>>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>>> [email protected]
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>>
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to