Hi Daniel, I think your request is probably one that can be worked out in such a way that private information is sufficiently protected. The request from Michal, at least as I understand its current form, is of a much different scope. Thanks for following up.
Pine On Wed, Mar 23, 2016 at 7:29 AM, Nuria Ruiz <[email protected]> wrote: > >In my understanding these fields do not include any "personal > information" as per the WMF privacy policy. Please correct me if I'm wrong > here. > This is correct for data requested here: > https://phabricator.wikimedia.org/T128132 > > On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger <[email protected]> > wrote: > >> Hi everyone, >> >> as the one, who requested data for performance research/testing, I'm >> happy to participate in the discussion. >> >> The second request, by Michal, might not be about performance. I believe >> Michal hasn't provided any details, as yet. I thought I could help Michal >> by pointing out similarities to my request, but I now see that the two >> requests might be quite different. >> >> It is my goal to compile a dataset, which does not include any private >> data. My request essentially asks for a higher-resolution version of the >> publicly available pagecounts data. And an update to a dataset, which has >> been made public in 2007 [1]. >> >> Specifically, the data set would hold the same fields as the pagecounts >> data, at a higher sampling rate: 1:10 instead of hourly. >> In addition to the pagecounts fields, the public 2007 dataset has one >> additional field "save_flag", which indicates whether the request changed a >> web page. In order to compile this save_flag, three other webrequest fields >> need to be accessed, as pointed out in Tim Starling's email [2]. Tim was >> the one, who helped compiling the 2007 dataset. >> >> In my understanding these fields do not include any "personal >> information" as per the WMF privacy policy. Please correct me if I'm wrong >> here. >> >> >> I also would like to point out that I'm asking to make this dataset >> public (as opposed to giving it to only my research group). If helpful, I'd >> be willing to host this dataset on my institutions web server, or in a >> public AWS S3 bucket to facilitate access by the community. >> >> I made a few updates to clarify these points in the phabricator item, >> were you can find further information: >> https://phabricator.wikimedia.org/T128132 >> The comments on that page discuss how we can restrict the scope to only >> the English Wikipedia and to individual WMF caching servers to scale down >> the dataset size. >> >> >> Let me know what you think. >> >> Best, >> Daniel >> >> [1] http://www.wikibench.eu/?page_id=60 >> [2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408 >> >> >> >> On 03/22/2016 08:55 PM, Pine W wrote: >> >> Hi Dan, >> >> Agreed, I think it makes sense to consider a subject-specific request for >> pages that are within the scope of epidemiology, such as influenza, where >> we have reason to think that there could be public health benefits in >> analyzing the data and there are reasonable safeguards to protect user >> anonymity. >> >> A request for 1 month of the private data requested here, which appears >> to be for all pages on all projects, is far too broadly scoped. Also, in >> general, I my instinct would be to deny external requests for WMF private >> data for purposes of performance testing. It seems to me that the risks far >> outweigh the benefits to Wikimedia, and that processing requests like these >> would be a suboptimal use of WMF staff time. >> >> Pine >> >> On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <[email protected] >> > wrote: >> >>> Pine, there are actually two separate requests and they shouldn't be >>> mixed. The performance-related one is research as far as I understand, and >>> the other one we have no details yet. I welcome a public discussion of >>> either, and of course would respect any opinions held by the analytics >>> community at large. We have every intention to be good stewards of this >>> data and for what it's worth, I'm very skeptical of allowing access to >>> private data, unless for obviously beneficial purposes like flu >>> forecasting, etc. >>> >>> On Tue, Mar 22, 2016 at 1:37 PM, Pine W < <[email protected]> >>> [email protected]> wrote: >>> >>>> I'd appreciate a clarification about the purpose of this request if >>>> Wikimedia private data is involved. If I am understanding correctly, the >>>> purpose of this request is for access to Wikimedia private data for >>>> assistsnce with 3rd party performance testing. If that is the case, I >>>> believe that the access request for private should simply be denied. >>>> >>>> Pine >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> >> _______________________________________________ >> Analytics mailing >> [email protected]https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
