+1

On Mon, Oct 20, 2014 at 7:15 AM, Oliver Keyes <[email protected]> wrote:

> Sorry, but no; what "additional conditions attached"? We're *not giving
> them any information* except for a boolean "this looks like illegitimate
> traffic, this one is legitimate or we can't tell" and a wild stab at what
> kind of illegitimate traffic it might be.
>
> Please bear in mind that what you're essentially saying - or, how it's
> coming off - is that there is some shady, undocumented,
> privacy-policy-thorny thing going on here. That's a pretty big statement to
> make about the activities of a researcher. If you think you can
> substantiate it: tell me what conditions you might attach to the
> aforementioned information? Better yet, what information do you think is
> being transmitted? If you don't think you can substantiate it, don't say it.
>
> Again, I'm sorry to be blunt. But to me this is kind of a big deal. If
> I've screwed up in some way I'd like you to stop talking in subtext and
> tell me how you think I have. Because at the moment I'm not entirely sure
> what I'm meant to be clarifying. But if I haven't, this sort of discussion
> can have a big impact on someone's reputation, and I'd like to clear it up.
>
> On 19 October 2014 03:24, Pine W <[email protected]> wrote:
>
>> Thanks very much, Toby and everyone.
>>
>> Ironholds, I appreciate your doing traffic research on a volunteer basis
>> for the benefit of the Signpost and the community. I'm concerned about the
>> system as a whole may need a closer look, and I'm glad that Toby will be
>> doing this with input from Legal.
>>
>> Toby: I hope we can continue to get some Ironholds-sponsored filtering
>> for the Traffic Report, although we may need to get it with some additional
>> conditions attached.
>>
>> Thanks and regards,
>>
>> Pine
>>
>> On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin <[email protected]>
>> wrote:
>>
>>> Folks --
>>>
>>> While I'm pleased that this validation was being done by a team member
>>> with full knowledge of our privacy and data retention policies, I think
>>> some good points have been raised that we're going to need to discuss as a
>>> team. I've reached out to legal for their assistance is figuring out the
>>> path forward.
>>>
>>> -Toby
>>>
>>> On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <[email protected]
>>> > wrote:
>>>
>>>> I see - Oliver's batman.  Nothing to see here, moving on.
>>>>
>>>> On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <[email protected]>
>>>> wrote:
>>>>
>>>>> I should also point out that "Toby not knowing who the staffer doing
>>>>> this one, highly specific, very minor piece of data-dogging is" does not
>>>>> equate to analytics not knowing who it is. I don't know what you do for a
>>>>> living but do you tend to give your boss's boss a constant play-by-play,
>>>>> or? ;p. It's documented in Trello just like everything else.
>>>>>
>>>>> On 17 October 2014 16:55, Oliver Keyes <[email protected]> wrote:
>>>>>
>>>>>> It's me. Hi! I'm sort of confused by this.
>>>>>>
>>>>>> In terms of shady back-alley data dealing, let me set out exactly
>>>>>> what happens.
>>>>>>
>>>>>> Every week, the signpost emails me a list of articles that have
>>>>>> unexpectedly high pageview counts and would be in the top 25, but nobody
>>>>>> can quite work out why they're so popular. I go through the logs for the
>>>>>> last week (I'd be unable to do this for any queries more than a month ago
>>>>>> anyway, since we only keep the unsampled data for that long, but a week 
>>>>>> is
>>>>>> what's relevant here), and pull out a tuple of {ip,referer,user
>>>>>> agent,article, requests} for the articles on that list.
>>>>>>
>>>>>> These tuples, which exist exclusively on our analytics machines (not
>>>>>> even my personal, encrypted work laptop: they're only stored server-side,
>>>>>> at all steps in this) are than hand-parsed by me. Can we pin all of the
>>>>>> requests for [article], or at least most of them, on a single IP address,
>>>>>> or a single {IP,user_agent} pair? Then it's probably a spammer or a 
>>>>>> spider
>>>>>> or an [expletive]. No? Okay, if we sum by referer, do we see a common
>>>>>> referer? If so, is that an actual referer or a fly-by-night live mirror?
>>>>>> Questions like that.
>>>>>>
>>>>>> When I'm done with all of the articles, I email the signpost with
>>>>>> "for article1, that looks legit. Article2 is a web crawler I'm going to
>>>>>> email and shout at. Article3 is a live mirror. Article4 looks legit.
>>>>>> Article5...". These requests are logged on our trello board, just like 
>>>>>> any
>>>>>> other data request from any other party, community or staff. Milowent and
>>>>>> the other signposters get zero IPs, zero user agents, and nothing 
>>>>>> anywhere
>>>>>> near that range of information: that stuff doesn't even leave the server.
>>>>>> And when I'm done with it, I nuke it so it's not even *there*.
>>>>>>
>>>>>> I hope that clarifies what's happening here. If you have specific
>>>>>> questions about what we keep that's obviously more of a question for
>>>>>> management.
>>>>>>
>>>>>> On 17 October 2014 12:27, Jonathan Morgan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Pine, have you considered asking Milowent who they work with on the
>>>>>>> IP data? I really, really doubt that there is some sort of shady 
>>>>>>> back-alley
>>>>>>> data dealing going down here. - Jonathan
>>>>>>>
>>>>>>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks Toby.
>>>>>>>>
>>>>>>>> I understand that IPs are not an especially accurate way to look at
>>>>>>>> unique visitors, but for the purposes of the Signpost's traffic report 
>>>>>>>> and
>>>>>>>> the Top 25 I feel that they are reasonable approximations of ways to 
>>>>>>>> filter
>>>>>>>> out what appear to be automated requests.
>>>>>>>>
>>>>>>>> I am ok with holding those logs for 30 days, although I am a little
>>>>>>>> surprised to hear that this is happening. However, what worries me a 
>>>>>>>> bit
>>>>>>>> more is the idea that a staff member can be accessing those logs 
>>>>>>>> without
>>>>>>>> that access being recorded. This might be something that you wish to
>>>>>>>> investigate further.
>>>>>>>>
>>>>>>>> I am not interested in getting this staff person into trouble. The
>>>>>>>> information that they are providing is useful to the Signpost and 
>>>>>>>> certainly
>>>>>>>> seems to be sanitized to a reasonable degree. However, it does concern 
>>>>>>>> me
>>>>>>>> that they can access these logs without someone knowing about it, it 
>>>>>>>> seems
>>>>>>>> to me that this sort of activity should be proactively disclosed to 
>>>>>>>> people
>>>>>>>> in WMF who conduct legal and security reviews, and I hope you will 
>>>>>>>> consider
>>>>>>>> what sort of security features are appropriate to make sure that 
>>>>>>>> occasions
>>>>>>>> when anyone accesses the raw logs are recorded in a robust manner. I 
>>>>>>>> worry
>>>>>>>> that if this one staffer can access logs without the higher-ups knowing
>>>>>>>> about it, it is possible that someone who intends to do unethical
>>>>>>>> activities with WMF's data could also access the logs without being 
>>>>>>>> noticed.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Pine
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Pine --
>>>>>>>>>
>>>>>>>>> Thanks for this -- it's a challenging topic but one that the
>>>>>>>>> Analytics team takes very seriously.
>>>>>>>>>
>>>>>>>>> I'm not familiar with the IP address review that's referenced in
>>>>>>>>> the link. I don't know who the staffer might be. We don't currently
>>>>>>>>> calculate unique visitors to anything in Analytics and IP address is 
>>>>>>>>> not a
>>>>>>>>> particularly accurate way to assess unique visitors regardless (due to
>>>>>>>>> proxies/NATs/etc).
>>>>>>>>>
>>>>>>>>> We do store IPs as part of page requests in our raw logs which are
>>>>>>>>> deleted every 30 days. This data is kept on a system where access is
>>>>>>>>> limited and controlled by the operations team. We're in line with the
>>>>>>>>> privacy policy on this.
>>>>>>>>>
>>>>>>>>> To be clear, we are currently considering mechanisms to count
>>>>>>>>> unique "requests" -- we rely on Comscore for this data and for several
>>>>>>>>> reasons, primarily related to mobile usage, it's not sufficient to
>>>>>>>>> understand our usage patterns. We are putting together some proposals 
>>>>>>>>> to do
>>>>>>>>> this in as limited way as possible and that's respectful to our users.
>>>>>>>>> We'll share this with the community when we feel we understand the use
>>>>>>>>> cases and trade-offs well enough to discuss in an informed manner.
>>>>>>>>>
>>>>>>>>> -Toby
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We do store the IP address associated with varnish requests as
>>>>>>>>> part of the log. This data is
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi again Analytics,
>>>>>>>>>>
>>>>>>>>>> I was under the impression that no records are kept of which IPs
>>>>>>>>>> access which articles on Wikipedia when no edits are made, but it 
>>>>>>>>>> appears
>>>>>>>>>> that such records are in fact kept [1].
>>>>>>>>>>
>>>>>>>>>> Is this proper? This practice appears to be permissible under the
>>>>>>>>>> Privacy Policy which states that "We use IP addresses for research 
>>>>>>>>>> and
>>>>>>>>>> analytics; to better personalize content, notices, and settings for 
>>>>>>>>>> you; to
>>>>>>>>>> fight spam, identity theft, malware, and other kinds of abuse; and to
>>>>>>>>>> provide better mobile and other applications."
>>>>>>>>>>
>>>>>>>>>> It is possible that this information is relevant for determining
>>>>>>>>>> the number of unique visitors that Wikipedia gets and that this 
>>>>>>>>>> information
>>>>>>>>>> is always properly filtered before it gets to the Signpost. However, 
>>>>>>>>>> given
>>>>>>>>>> recent discussions which I thought said that Wikipedia was not 
>>>>>>>>>> instrumented
>>>>>>>>>> to track unique visitors, I am surprised to learn that this already 
>>>>>>>>>> seems
>>>>>>>>>> to be happening and that the situation has been this way for some 
>>>>>>>>>> time, so
>>>>>>>>>> I would appreciate clarification.
>>>>>>>>>>
>>>>>>>>>> I want to emphasize that this question is about clarifying the
>>>>>>>>>> practice of tracking likely unique visitors by IP. This question is 
>>>>>>>>>> not
>>>>>>>>>> intended to start flame wars, get people into trouble, or limit the
>>>>>>>>>> Signpost's access to properly filtered information if there has been 
>>>>>>>>>> a
>>>>>>>>>> determination that WMF's retention of the raw data is appropriate. 
>>>>>>>>>> There
>>>>>>>>>> might be appropriate secondary questions about making sure that 
>>>>>>>>>> access to
>>>>>>>>>> the raw IP access data is carefully contained and secured.
>>>>>>>>>>
>>>>>>>>>> Thank you very much,
>>>>>>>>>>
>>>>>>>>>> Pine
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&diff=629934257&oldid=629932288
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Analytics mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jonathan T. Morgan
>>>>>>> Learning Strategist
>>>>>>> Wikimedia Foundation
>>>>>>> User:Jmorgan (WMF)
>>>>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Oliver Keyes
>>>>>> Research Analyst
>>>>>> Wikimedia Foundation
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Oliver Keyes
>>>>> Research Analyst
>>>>> Wikimedia Foundation
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Jonathan T. Morgan
Learning Strategist
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
[email protected]
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to