Thanks very much, Toby and everyone. Ironholds, I appreciate your doing traffic research on a volunteer basis for the benefit of the Signpost and the community. I'm concerned about the system as a whole may need a closer look, and I'm glad that Toby will be doing this with input from Legal.
Toby: I hope we can continue to get some Ironholds-sponsored filtering for the Traffic Report, although we may need to get it with some additional conditions attached. Thanks and regards, Pine On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin <[email protected]> wrote: > Folks -- > > While I'm pleased that this validation was being done by a team member > with full knowledge of our privacy and data retention policies, I think > some good points have been raised that we're going to need to discuss as a > team. I've reached out to legal for their assistance is figuring out the > path forward. > > -Toby > > On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <[email protected]> > wrote: > >> I see - Oliver's batman. Nothing to see here, moving on. >> >> On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <[email protected]> >> wrote: >> >>> I should also point out that "Toby not knowing who the staffer doing >>> this one, highly specific, very minor piece of data-dogging is" does not >>> equate to analytics not knowing who it is. I don't know what you do for a >>> living but do you tend to give your boss's boss a constant play-by-play, >>> or? ;p. It's documented in Trello just like everything else. >>> >>> On 17 October 2014 16:55, Oliver Keyes <[email protected]> wrote: >>> >>>> It's me. Hi! I'm sort of confused by this. >>>> >>>> In terms of shady back-alley data dealing, let me set out exactly what >>>> happens. >>>> >>>> Every week, the signpost emails me a list of articles that have >>>> unexpectedly high pageview counts and would be in the top 25, but nobody >>>> can quite work out why they're so popular. I go through the logs for the >>>> last week (I'd be unable to do this for any queries more than a month ago >>>> anyway, since we only keep the unsampled data for that long, but a week is >>>> what's relevant here), and pull out a tuple of {ip,referer,user >>>> agent,article, requests} for the articles on that list. >>>> >>>> These tuples, which exist exclusively on our analytics machines (not >>>> even my personal, encrypted work laptop: they're only stored server-side, >>>> at all steps in this) are than hand-parsed by me. Can we pin all of the >>>> requests for [article], or at least most of them, on a single IP address, >>>> or a single {IP,user_agent} pair? Then it's probably a spammer or a spider >>>> or an [expletive]. No? Okay, if we sum by referer, do we see a common >>>> referer? If so, is that an actual referer or a fly-by-night live mirror? >>>> Questions like that. >>>> >>>> When I'm done with all of the articles, I email the signpost with "for >>>> article1, that looks legit. Article2 is a web crawler I'm going to email >>>> and shout at. Article3 is a live mirror. Article4 looks legit. >>>> Article5...". These requests are logged on our trello board, just like any >>>> other data request from any other party, community or staff. Milowent and >>>> the other signposters get zero IPs, zero user agents, and nothing anywhere >>>> near that range of information: that stuff doesn't even leave the server. >>>> And when I'm done with it, I nuke it so it's not even *there*. >>>> >>>> I hope that clarifies what's happening here. If you have specific >>>> questions about what we keep that's obviously more of a question for >>>> management. >>>> >>>> On 17 October 2014 12:27, Jonathan Morgan <[email protected]> >>>> wrote: >>>> >>>>> Pine, have you considered asking Milowent who they work with on the IP >>>>> data? I really, really doubt that there is some sort of shady back-alley >>>>> data dealing going down here. - Jonathan >>>>> >>>>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <[email protected]> wrote: >>>>> >>>>>> Thanks Toby. >>>>>> >>>>>> I understand that IPs are not an especially accurate way to look at >>>>>> unique visitors, but for the purposes of the Signpost's traffic report >>>>>> and >>>>>> the Top 25 I feel that they are reasonable approximations of ways to >>>>>> filter >>>>>> out what appear to be automated requests. >>>>>> >>>>>> I am ok with holding those logs for 30 days, although I am a little >>>>>> surprised to hear that this is happening. However, what worries me a bit >>>>>> more is the idea that a staff member can be accessing those logs without >>>>>> that access being recorded. This might be something that you wish to >>>>>> investigate further. >>>>>> >>>>>> I am not interested in getting this staff person into trouble. The >>>>>> information that they are providing is useful to the Signpost and >>>>>> certainly >>>>>> seems to be sanitized to a reasonable degree. However, it does concern me >>>>>> that they can access these logs without someone knowing about it, it >>>>>> seems >>>>>> to me that this sort of activity should be proactively disclosed to >>>>>> people >>>>>> in WMF who conduct legal and security reviews, and I hope you will >>>>>> consider >>>>>> what sort of security features are appropriate to make sure that >>>>>> occasions >>>>>> when anyone accesses the raw logs are recorded in a robust manner. I >>>>>> worry >>>>>> that if this one staffer can access logs without the higher-ups knowing >>>>>> about it, it is possible that someone who intends to do unethical >>>>>> activities with WMF's data could also access the logs without being >>>>>> noticed. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Pine >>>>>> >>>>>> >>>>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Pine -- >>>>>>> >>>>>>> Thanks for this -- it's a challenging topic but one that the >>>>>>> Analytics team takes very seriously. >>>>>>> >>>>>>> I'm not familiar with the IP address review that's referenced in the >>>>>>> link. I don't know who the staffer might be. We don't currently >>>>>>> calculate >>>>>>> unique visitors to anything in Analytics and IP address is not a >>>>>>> particularly accurate way to assess unique visitors regardless (due to >>>>>>> proxies/NATs/etc). >>>>>>> >>>>>>> We do store IPs as part of page requests in our raw logs which are >>>>>>> deleted every 30 days. This data is kept on a system where access is >>>>>>> limited and controlled by the operations team. We're in line with the >>>>>>> privacy policy on this. >>>>>>> >>>>>>> To be clear, we are currently considering mechanisms to count unique >>>>>>> "requests" -- we rely on Comscore for this data and for several reasons, >>>>>>> primarily related to mobile usage, it's not sufficient to understand our >>>>>>> usage patterns. We are putting together some proposals to do this in as >>>>>>> limited way as possible and that's respectful to our users. We'll share >>>>>>> this with the community when we feel we understand the use cases and >>>>>>> trade-offs well enough to discuss in an informed manner. >>>>>>> >>>>>>> -Toby >>>>>>> >>>>>>> >>>>>>> >>>>>>> We do store the IP address associated with varnish requests as part >>>>>>> of the log. This data is >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W <[email protected]> wrote: >>>>>>> >>>>>>>> Hi again Analytics, >>>>>>>> >>>>>>>> I was under the impression that no records are kept of which IPs >>>>>>>> access which articles on Wikipedia when no edits are made, but it >>>>>>>> appears >>>>>>>> that such records are in fact kept [1]. >>>>>>>> >>>>>>>> Is this proper? This practice appears to be permissible under the >>>>>>>> Privacy Policy which states that "We use IP addresses for research and >>>>>>>> analytics; to better personalize content, notices, and settings for >>>>>>>> you; to >>>>>>>> fight spam, identity theft, malware, and other kinds of abuse; and to >>>>>>>> provide better mobile and other applications." >>>>>>>> >>>>>>>> It is possible that this information is relevant for determining >>>>>>>> the number of unique visitors that Wikipedia gets and that this >>>>>>>> information >>>>>>>> is always properly filtered before it gets to the Signpost. However, >>>>>>>> given >>>>>>>> recent discussions which I thought said that Wikipedia was not >>>>>>>> instrumented >>>>>>>> to track unique visitors, I am surprised to learn that this already >>>>>>>> seems >>>>>>>> to be happening and that the situation has been this way for some >>>>>>>> time, so >>>>>>>> I would appreciate clarification. >>>>>>>> >>>>>>>> I want to emphasize that this question is about clarifying the >>>>>>>> practice of tracking likely unique visitors by IP. This question is not >>>>>>>> intended to start flame wars, get people into trouble, or limit the >>>>>>>> Signpost's access to properly filtered information if there has been a >>>>>>>> determination that WMF's retention of the raw data is appropriate. >>>>>>>> There >>>>>>>> might be appropriate secondary questions about making sure that access >>>>>>>> to >>>>>>>> the raw IP access data is carefully contained and secured. >>>>>>>> >>>>>>>> Thank you very much, >>>>>>>> >>>>>>>> Pine >>>>>>>> >>>>>>>> [1] >>>>>>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&diff=629934257&oldid=629932288 >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jonathan T. Morgan >>>>> Learning Strategist >>>>> Wikimedia Foundation >>>>> User:Jmorgan (WMF) >>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>>>> [email protected] >>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> >>> >>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
