+1 On Mon, Oct 20, 2014 at 7:15 AM, Oliver Keyes <[email protected]> wrote:
> Sorry, but no; what "additional conditions attached"? We're *not giving > them any information* except for a boolean "this looks like illegitimate > traffic, this one is legitimate or we can't tell" and a wild stab at what > kind of illegitimate traffic it might be. > > Please bear in mind that what you're essentially saying - or, how it's > coming off - is that there is some shady, undocumented, > privacy-policy-thorny thing going on here. That's a pretty big statement to > make about the activities of a researcher. If you think you can > substantiate it: tell me what conditions you might attach to the > aforementioned information? Better yet, what information do you think is > being transmitted? If you don't think you can substantiate it, don't say it. > > Again, I'm sorry to be blunt. But to me this is kind of a big deal. If > I've screwed up in some way I'd like you to stop talking in subtext and > tell me how you think I have. Because at the moment I'm not entirely sure > what I'm meant to be clarifying. But if I haven't, this sort of discussion > can have a big impact on someone's reputation, and I'd like to clear it up. > > On 19 October 2014 03:24, Pine W <[email protected]> wrote: > >> Thanks very much, Toby and everyone. >> >> Ironholds, I appreciate your doing traffic research on a volunteer basis >> for the benefit of the Signpost and the community. I'm concerned about the >> system as a whole may need a closer look, and I'm glad that Toby will be >> doing this with input from Legal. >> >> Toby: I hope we can continue to get some Ironholds-sponsored filtering >> for the Traffic Report, although we may need to get it with some additional >> conditions attached. >> >> Thanks and regards, >> >> Pine >> >> On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin <[email protected]> >> wrote: >> >>> Folks -- >>> >>> While I'm pleased that this validation was being done by a team member >>> with full knowledge of our privacy and data retention policies, I think >>> some good points have been raised that we're going to need to discuss as a >>> team. I've reached out to legal for their assistance is figuring out the >>> path forward. >>> >>> -Toby >>> >>> On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <[email protected] >>> > wrote: >>> >>>> I see - Oliver's batman. Nothing to see here, moving on. >>>> >>>> On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <[email protected]> >>>> wrote: >>>> >>>>> I should also point out that "Toby not knowing who the staffer doing >>>>> this one, highly specific, very minor piece of data-dogging is" does not >>>>> equate to analytics not knowing who it is. I don't know what you do for a >>>>> living but do you tend to give your boss's boss a constant play-by-play, >>>>> or? ;p. It's documented in Trello just like everything else. >>>>> >>>>> On 17 October 2014 16:55, Oliver Keyes <[email protected]> wrote: >>>>> >>>>>> It's me. Hi! I'm sort of confused by this. >>>>>> >>>>>> In terms of shady back-alley data dealing, let me set out exactly >>>>>> what happens. >>>>>> >>>>>> Every week, the signpost emails me a list of articles that have >>>>>> unexpectedly high pageview counts and would be in the top 25, but nobody >>>>>> can quite work out why they're so popular. I go through the logs for the >>>>>> last week (I'd be unable to do this for any queries more than a month ago >>>>>> anyway, since we only keep the unsampled data for that long, but a week >>>>>> is >>>>>> what's relevant here), and pull out a tuple of {ip,referer,user >>>>>> agent,article, requests} for the articles on that list. >>>>>> >>>>>> These tuples, which exist exclusively on our analytics machines (not >>>>>> even my personal, encrypted work laptop: they're only stored server-side, >>>>>> at all steps in this) are than hand-parsed by me. Can we pin all of the >>>>>> requests for [article], or at least most of them, on a single IP address, >>>>>> or a single {IP,user_agent} pair? Then it's probably a spammer or a >>>>>> spider >>>>>> or an [expletive]. No? Okay, if we sum by referer, do we see a common >>>>>> referer? If so, is that an actual referer or a fly-by-night live mirror? >>>>>> Questions like that. >>>>>> >>>>>> When I'm done with all of the articles, I email the signpost with >>>>>> "for article1, that looks legit. Article2 is a web crawler I'm going to >>>>>> email and shout at. Article3 is a live mirror. Article4 looks legit. >>>>>> Article5...". These requests are logged on our trello board, just like >>>>>> any >>>>>> other data request from any other party, community or staff. Milowent and >>>>>> the other signposters get zero IPs, zero user agents, and nothing >>>>>> anywhere >>>>>> near that range of information: that stuff doesn't even leave the server. >>>>>> And when I'm done with it, I nuke it so it's not even *there*. >>>>>> >>>>>> I hope that clarifies what's happening here. If you have specific >>>>>> questions about what we keep that's obviously more of a question for >>>>>> management. >>>>>> >>>>>> On 17 October 2014 12:27, Jonathan Morgan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Pine, have you considered asking Milowent who they work with on the >>>>>>> IP data? I really, really doubt that there is some sort of shady >>>>>>> back-alley >>>>>>> data dealing going down here. - Jonathan >>>>>>> >>>>>>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <[email protected]> wrote: >>>>>>> >>>>>>>> Thanks Toby. >>>>>>>> >>>>>>>> I understand that IPs are not an especially accurate way to look at >>>>>>>> unique visitors, but for the purposes of the Signpost's traffic report >>>>>>>> and >>>>>>>> the Top 25 I feel that they are reasonable approximations of ways to >>>>>>>> filter >>>>>>>> out what appear to be automated requests. >>>>>>>> >>>>>>>> I am ok with holding those logs for 30 days, although I am a little >>>>>>>> surprised to hear that this is happening. However, what worries me a >>>>>>>> bit >>>>>>>> more is the idea that a staff member can be accessing those logs >>>>>>>> without >>>>>>>> that access being recorded. This might be something that you wish to >>>>>>>> investigate further. >>>>>>>> >>>>>>>> I am not interested in getting this staff person into trouble. The >>>>>>>> information that they are providing is useful to the Signpost and >>>>>>>> certainly >>>>>>>> seems to be sanitized to a reasonable degree. However, it does concern >>>>>>>> me >>>>>>>> that they can access these logs without someone knowing about it, it >>>>>>>> seems >>>>>>>> to me that this sort of activity should be proactively disclosed to >>>>>>>> people >>>>>>>> in WMF who conduct legal and security reviews, and I hope you will >>>>>>>> consider >>>>>>>> what sort of security features are appropriate to make sure that >>>>>>>> occasions >>>>>>>> when anyone accesses the raw logs are recorded in a robust manner. I >>>>>>>> worry >>>>>>>> that if this one staffer can access logs without the higher-ups knowing >>>>>>>> about it, it is possible that someone who intends to do unethical >>>>>>>> activities with WMF's data could also access the logs without being >>>>>>>> noticed. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Pine >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hi Pine -- >>>>>>>>> >>>>>>>>> Thanks for this -- it's a challenging topic but one that the >>>>>>>>> Analytics team takes very seriously. >>>>>>>>> >>>>>>>>> I'm not familiar with the IP address review that's referenced in >>>>>>>>> the link. I don't know who the staffer might be. We don't currently >>>>>>>>> calculate unique visitors to anything in Analytics and IP address is >>>>>>>>> not a >>>>>>>>> particularly accurate way to assess unique visitors regardless (due to >>>>>>>>> proxies/NATs/etc). >>>>>>>>> >>>>>>>>> We do store IPs as part of page requests in our raw logs which are >>>>>>>>> deleted every 30 days. This data is kept on a system where access is >>>>>>>>> limited and controlled by the operations team. We're in line with the >>>>>>>>> privacy policy on this. >>>>>>>>> >>>>>>>>> To be clear, we are currently considering mechanisms to count >>>>>>>>> unique "requests" -- we rely on Comscore for this data and for several >>>>>>>>> reasons, primarily related to mobile usage, it's not sufficient to >>>>>>>>> understand our usage patterns. We are putting together some proposals >>>>>>>>> to do >>>>>>>>> this in as limited way as possible and that's respectful to our users. >>>>>>>>> We'll share this with the community when we feel we understand the use >>>>>>>>> cases and trade-offs well enough to discuss in an informed manner. >>>>>>>>> >>>>>>>>> -Toby >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> We do store the IP address associated with varnish requests as >>>>>>>>> part of the log. This data is >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi again Analytics, >>>>>>>>>> >>>>>>>>>> I was under the impression that no records are kept of which IPs >>>>>>>>>> access which articles on Wikipedia when no edits are made, but it >>>>>>>>>> appears >>>>>>>>>> that such records are in fact kept [1]. >>>>>>>>>> >>>>>>>>>> Is this proper? This practice appears to be permissible under the >>>>>>>>>> Privacy Policy which states that "We use IP addresses for research >>>>>>>>>> and >>>>>>>>>> analytics; to better personalize content, notices, and settings for >>>>>>>>>> you; to >>>>>>>>>> fight spam, identity theft, malware, and other kinds of abuse; and to >>>>>>>>>> provide better mobile and other applications." >>>>>>>>>> >>>>>>>>>> It is possible that this information is relevant for determining >>>>>>>>>> the number of unique visitors that Wikipedia gets and that this >>>>>>>>>> information >>>>>>>>>> is always properly filtered before it gets to the Signpost. However, >>>>>>>>>> given >>>>>>>>>> recent discussions which I thought said that Wikipedia was not >>>>>>>>>> instrumented >>>>>>>>>> to track unique visitors, I am surprised to learn that this already >>>>>>>>>> seems >>>>>>>>>> to be happening and that the situation has been this way for some >>>>>>>>>> time, so >>>>>>>>>> I would appreciate clarification. >>>>>>>>>> >>>>>>>>>> I want to emphasize that this question is about clarifying the >>>>>>>>>> practice of tracking likely unique visitors by IP. This question is >>>>>>>>>> not >>>>>>>>>> intended to start flame wars, get people into trouble, or limit the >>>>>>>>>> Signpost's access to properly filtered information if there has been >>>>>>>>>> a >>>>>>>>>> determination that WMF's retention of the raw data is appropriate. >>>>>>>>>> There >>>>>>>>>> might be appropriate secondary questions about making sure that >>>>>>>>>> access to >>>>>>>>>> the raw IP access data is carefully contained and secured. >>>>>>>>>> >>>>>>>>>> Thank you very much, >>>>>>>>>> >>>>>>>>>> Pine >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&diff=629934257&oldid=629932288 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Analytics mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Analytics mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jonathan T. Morgan >>>>>>> Learning Strategist >>>>>>> Wikimedia Foundation >>>>>>> User:Jmorgan (WMF) >>>>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Oliver Keyes >>>>>> Research Analyst >>>>>> Wikimedia Foundation >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Oliver Keyes >>>>> Research Analyst >>>>> Wikimedia Foundation >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> [email protected]
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
