>In my understanding these fields do not include any "personal information" as per the WMF privacy policy. Please correct me if I'm wrong here. This is correct for data requested here: https://phabricator.wikimedia.org/T128132
On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger <[email protected]> wrote: > Hi everyone, > > as the one, who requested data for performance research/testing, I'm happy > to participate in the discussion. > > The second request, by Michal, might not be about performance. I believe > Michal hasn't provided any details, as yet. I thought I could help Michal > by pointing out similarities to my request, but I now see that the two > requests might be quite different. > > It is my goal to compile a dataset, which does not include any private > data. My request essentially asks for a higher-resolution version of the > publicly available pagecounts data. And an update to a dataset, which has > been made public in 2007 [1]. > > Specifically, the data set would hold the same fields as the pagecounts > data, at a higher sampling rate: 1:10 instead of hourly. > In addition to the pagecounts fields, the public 2007 dataset has one > additional field "save_flag", which indicates whether the request changed a > web page. In order to compile this save_flag, three other webrequest fields > need to be accessed, as pointed out in Tim Starling's email [2]. Tim was > the one, who helped compiling the 2007 dataset. > > In my understanding these fields do not include any "personal information" > as per the WMF privacy policy. Please correct me if I'm wrong here. > > > I also would like to point out that I'm asking to make this dataset public > (as opposed to giving it to only my research group). If helpful, I'd be > willing to host this dataset on my institutions web server, or in a public > AWS S3 bucket to facilitate access by the community. > > I made a few updates to clarify these points in the phabricator item, were > you can find further information: > https://phabricator.wikimedia.org/T128132 > The comments on that page discuss how we can restrict the scope to only > the English Wikipedia and to individual WMF caching servers to scale down > the dataset size. > > > Let me know what you think. > > Best, > Daniel > > [1] http://www.wikibench.eu/?page_id=60 > [2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408 > > > > On 03/22/2016 08:55 PM, Pine W wrote: > > Hi Dan, > > Agreed, I think it makes sense to consider a subject-specific request for > pages that are within the scope of epidemiology, such as influenza, where > we have reason to think that there could be public health benefits in > analyzing the data and there are reasonable safeguards to protect user > anonymity. > > A request for 1 month of the private data requested here, which appears to > be for all pages on all projects, is far too broadly scoped. Also, in > general, I my instinct would be to deny external requests for WMF private > data for purposes of performance testing. It seems to me that the risks far > outweigh the benefits to Wikimedia, and that processing requests like these > would be a suboptimal use of WMF staff time. > > Pine > > On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <[email protected]> > wrote: > >> Pine, there are actually two separate requests and they shouldn't be >> mixed. The performance-related one is research as far as I understand, and >> the other one we have no details yet. I welcome a public discussion of >> either, and of course would respect any opinions held by the analytics >> community at large. We have every intention to be good stewards of this >> data and for what it's worth, I'm very skeptical of allowing access to >> private data, unless for obviously beneficial purposes like flu >> forecasting, etc. >> >> On Tue, Mar 22, 2016 at 1:37 PM, Pine W < <[email protected]> >> [email protected]> wrote: >> >>> I'd appreciate a clarification about the purpose of this request if >>> Wikimedia private data is involved. If I am understanding correctly, the >>> purpose of this request is for access to Wikimedia private data for >>> assistsnce with 3rd party performance testing. If that is the case, I >>> believe that the access request for private should simply be denied. >>> >>> Pine >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > _______________________________________________ > Analytics mailing > [email protected]https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
