I see - Oliver's batman. Nothing to see here, moving on. On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <[email protected]> wrote:
> I should also point out that "Toby not knowing who the staffer doing this > one, highly specific, very minor piece of data-dogging is" does not equate > to analytics not knowing who it is. I don't know what you do for a living > but do you tend to give your boss's boss a constant play-by-play, or? ;p. > It's documented in Trello just like everything else. > > On 17 October 2014 16:55, Oliver Keyes <[email protected]> wrote: > >> It's me. Hi! I'm sort of confused by this. >> >> In terms of shady back-alley data dealing, let me set out exactly what >> happens. >> >> Every week, the signpost emails me a list of articles that have >> unexpectedly high pageview counts and would be in the top 25, but nobody >> can quite work out why they're so popular. I go through the logs for the >> last week (I'd be unable to do this for any queries more than a month ago >> anyway, since we only keep the unsampled data for that long, but a week is >> what's relevant here), and pull out a tuple of {ip,referer,user >> agent,article, requests} for the articles on that list. >> >> These tuples, which exist exclusively on our analytics machines (not even >> my personal, encrypted work laptop: they're only stored server-side, at all >> steps in this) are than hand-parsed by me. Can we pin all of the requests >> for [article], or at least most of them, on a single IP address, or a >> single {IP,user_agent} pair? Then it's probably a spammer or a spider or an >> [expletive]. No? Okay, if we sum by referer, do we see a common referer? If >> so, is that an actual referer or a fly-by-night live mirror? Questions like >> that. >> >> When I'm done with all of the articles, I email the signpost with "for >> article1, that looks legit. Article2 is a web crawler I'm going to email >> and shout at. Article3 is a live mirror. Article4 looks legit. >> Article5...". These requests are logged on our trello board, just like any >> other data request from any other party, community or staff. Milowent and >> the other signposters get zero IPs, zero user agents, and nothing anywhere >> near that range of information: that stuff doesn't even leave the server. >> And when I'm done with it, I nuke it so it's not even *there*. >> >> I hope that clarifies what's happening here. If you have specific >> questions about what we keep that's obviously more of a question for >> management. >> >> On 17 October 2014 12:27, Jonathan Morgan <[email protected]> wrote: >> >>> Pine, have you considered asking Milowent who they work with on the IP >>> data? I really, really doubt that there is some sort of shady back-alley >>> data dealing going down here. - Jonathan >>> >>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <[email protected]> wrote: >>> >>>> Thanks Toby. >>>> >>>> I understand that IPs are not an especially accurate way to look at >>>> unique visitors, but for the purposes of the Signpost's traffic report and >>>> the Top 25 I feel that they are reasonable approximations of ways to filter >>>> out what appear to be automated requests. >>>> >>>> I am ok with holding those logs for 30 days, although I am a little >>>> surprised to hear that this is happening. However, what worries me a bit >>>> more is the idea that a staff member can be accessing those logs without >>>> that access being recorded. This might be something that you wish to >>>> investigate further. >>>> >>>> I am not interested in getting this staff person into trouble. The >>>> information that they are providing is useful to the Signpost and certainly >>>> seems to be sanitized to a reasonable degree. However, it does concern me >>>> that they can access these logs without someone knowing about it, it seems >>>> to me that this sort of activity should be proactively disclosed to people >>>> in WMF who conduct legal and security reviews, and I hope you will consider >>>> what sort of security features are appropriate to make sure that occasions >>>> when anyone accesses the raw logs are recorded in a robust manner. I worry >>>> that if this one staffer can access logs without the higher-ups knowing >>>> about it, it is possible that someone who intends to do unethical >>>> activities with WMF's data could also access the logs without being >>>> noticed. >>>> >>>> Thanks, >>>> >>>> Pine >>>> >>>> >>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <[email protected]> >>>> wrote: >>>> >>>>> Hi Pine -- >>>>> >>>>> Thanks for this -- it's a challenging topic but one that the Analytics >>>>> team takes very seriously. >>>>> >>>>> I'm not familiar with the IP address review that's referenced in the >>>>> link. I don't know who the staffer might be. We don't currently calculate >>>>> unique visitors to anything in Analytics and IP address is not a >>>>> particularly accurate way to assess unique visitors regardless (due to >>>>> proxies/NATs/etc). >>>>> >>>>> We do store IPs as part of page requests in our raw logs which are >>>>> deleted every 30 days. This data is kept on a system where access is >>>>> limited and controlled by the operations team. We're in line with the >>>>> privacy policy on this. >>>>> >>>>> To be clear, we are currently considering mechanisms to count unique >>>>> "requests" -- we rely on Comscore for this data and for several reasons, >>>>> primarily related to mobile usage, it's not sufficient to understand our >>>>> usage patterns. We are putting together some proposals to do this in as >>>>> limited way as possible and that's respectful to our users. We'll share >>>>> this with the community when we feel we understand the use cases and >>>>> trade-offs well enough to discuss in an informed manner. >>>>> >>>>> -Toby >>>>> >>>>> >>>>> >>>>> We do store the IP address associated with varnish requests as part of >>>>> the log. This data is >>>>> >>>>> >>>>> >>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W <[email protected]> wrote: >>>>> >>>>>> Hi again Analytics, >>>>>> >>>>>> I was under the impression that no records are kept of which IPs >>>>>> access which articles on Wikipedia when no edits are made, but it appears >>>>>> that such records are in fact kept [1]. >>>>>> >>>>>> Is this proper? This practice appears to be permissible under the >>>>>> Privacy Policy which states that "We use IP addresses for research and >>>>>> analytics; to better personalize content, notices, and settings for you; >>>>>> to >>>>>> fight spam, identity theft, malware, and other kinds of abuse; and to >>>>>> provide better mobile and other applications." >>>>>> >>>>>> It is possible that this information is relevant for determining the >>>>>> number of unique visitors that Wikipedia gets and that this information >>>>>> is >>>>>> always properly filtered before it gets to the Signpost. However, given >>>>>> recent discussions which I thought said that Wikipedia was not >>>>>> instrumented >>>>>> to track unique visitors, I am surprised to learn that this already seems >>>>>> to be happening and that the situation has been this way for some time, >>>>>> so >>>>>> I would appreciate clarification. >>>>>> >>>>>> I want to emphasize that this question is about clarifying the >>>>>> practice of tracking likely unique visitors by IP. This question is not >>>>>> intended to start flame wars, get people into trouble, or limit the >>>>>> Signpost's access to properly filtered information if there has been a >>>>>> determination that WMF's retention of the raw data is appropriate. There >>>>>> might be appropriate secondary questions about making sure that access to >>>>>> the raw IP access data is carefully contained and secured. >>>>>> >>>>>> Thank you very much, >>>>>> >>>>>> Pine >>>>>> >>>>>> [1] >>>>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&diff=629934257&oldid=629932288 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> >>> -- >>> Jonathan T. Morgan >>> Learning Strategist >>> Wikimedia Foundation >>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>> [email protected] >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
