>In my understanding these fields do not include any "personal information"
as per the WMF privacy policy. Please correct me if I'm wrong here.
This is correct for data requested here:
https://phabricator.wikimedia.org/T128132

On Wed, Mar 23, 2016 at 1:23 AM, Daniel Berger <[email protected]> wrote:

> Hi everyone,
>
> as the one, who requested data for performance research/testing, I'm happy
> to participate in the discussion.
>
> The second request, by Michal, might not be about performance. I believe
> Michal hasn't provided any details, as yet. I thought I could help Michal
> by pointing out similarities to my request, but I now see that the two
> requests might be quite different.
>
> It is my goal to compile a dataset, which does not include any private
> data. My request essentially asks for a higher-resolution version of the
> publicly available pagecounts data. And an update to a dataset, which has
> been made public in 2007 [1].
>
> Specifically, the data set would hold the same fields as the pagecounts
> data, at a higher sampling rate: 1:10 instead of hourly.
> In addition to the pagecounts fields, the public 2007 dataset has one
> additional field "save_flag", which indicates whether the request changed a
> web page. In order to compile this save_flag, three other webrequest fields
> need to be accessed, as pointed out in Tim Starling's email [2]. Tim was
> the one, who helped compiling the 2007 dataset.
>
> In my understanding these fields do not include any "personal information"
> as per the WMF privacy policy. Please correct me if I'm wrong here.
>
>
> I also would like to point out that I'm asking to make this dataset public
> (as opposed to giving it to only my research group). If helpful, I'd be
> willing to host this dataset on my institutions web server, or in a public
> AWS S3 bucket to facilitate access by the community.
>
> I made a few updates to clarify these points in the phabricator item, were
> you can find further information:
> https://phabricator.wikimedia.org/T128132
> The comments on that page discuss how we can restrict the scope to only
> the English Wikipedia and to individual WMF caching servers to scale down
> the dataset size.
>
>
> Let me know what you think.
>
> Best,
> Daniel
>
> [1] http://www.wikibench.eu/?page_id=60
> [2] http://thread.gmane.org/gmane.org.wikimedia.analytics/3405/focus=3408
>
>
>
> On 03/22/2016 08:55 PM, Pine W wrote:
>
> Hi Dan,
>
> Agreed, I think it makes sense to consider a subject-specific request for
> pages that are within the scope of epidemiology, such as influenza, where
> we have reason to think that there could be public health benefits in
> analyzing the data and there are reasonable safeguards to protect user
> anonymity.
>
> A request for 1 month of the private data requested here, which appears to
> be for all pages on all projects, is far too broadly scoped. Also, in
> general, I my instinct would be to deny external requests for WMF private
> data for purposes of performance testing. It seems to me that the risks far
> outweigh the benefits to Wikimedia, and that processing requests like these
> would be a suboptimal use of WMF staff time.
>
> Pine
>
> On Tue, Mar 22, 2016 at 12:44 PM, Dan Andreescu <[email protected]>
> wrote:
>
>> Pine, there are actually two separate requests and they shouldn't be
>> mixed.  The performance-related one is research as far as I understand, and
>> the other one we have no details yet.  I welcome a public discussion of
>> either, and of course would respect any opinions held by the analytics
>> community at large.  We have every intention to be good stewards of this
>> data and for what it's worth, I'm very skeptical of allowing access to
>> private data, unless for obviously beneficial purposes like flu
>> forecasting, etc.
>>
>> On Tue, Mar 22, 2016 at 1:37 PM, Pine W < <[email protected]>
>> [email protected]> wrote:
>>
>>> I'd appreciate a clarification about the purpose of this request if
>>> Wikimedia private data is involved. If I am understanding correctly, the
>>> purpose of this request is for access to Wikimedia private data for
>>> assistsnce with 3rd party performance testing. If that is the case, I
>>> believe that the access request for private should simply be denied.
>>>
>>> Pine
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> _______________________________________________
> Analytics mailing 
> [email protected]https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to