> then yes you do need to go through an NDA process because you are asking to see raw user agent strings, and that's among the data that we guard very carefully. To clarify, raw user agents are only available for the last 60 days. Data is aggreggated after that period.
On Tue, Mar 22, 2016 at 5:42 AM, Dan Andreescu <[email protected]> wrote: > Michal, if what Daniel is saying is not sufficient, then yes you do need > to go through an NDA process because you are asking to see raw user agent > strings, and that's among the data that we guard very carefully. > > *From: *Daniel Berger > *Sent: *Monday, March 21, 2016 06:21 > *To: *[email protected] > *Reply To: *A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > *Subject: *Re: [Analytics] [wmf.webrequest data] one-time access > > Hi Michal, > > it seems that what you want is a data set, which would be very similar to > what I recently issued a request for: see this phabricator item > https://phabricator.wikimedia.org/T128132 > > There has been a public data set for the year 2007, part of which you > publicly available [1]. See [2] for a study using the 2007 data set. > > My focus has been on simulating the performance of WMF's caching servers, > for which the 2007 data set is insufficient. However, a different research > domain might require a slightly different focus of capturing the data set. > > The 2007 data set was captured with a sampling rate of 1:10. For my > project, such a high sampling rate would be perfect (1:100 might also > work). However, I learned that the current request rate is much higher so > we'd have to narrow the scope of the data set (e.g., by focussing on > specific WMF projects, like the English Wikipedia). You can find a > discussion on the phabricator page linked above. > > What would be the lowest sampling rate allowable for your project? I > assume the publicly available hourly access data [3], [4] would be > insufficient? > > Feel free to comment on the phabricator item, maybe we can compile a > single data set that works for both of our research domains and other helps > other people? > > Best, > Daniel > > [1] http://www.wikibench.eu/?page_id=60 > [2] http://www.distributed-systems.net/papers/2009.comnet-wiki.pdf > [3] > https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites > [4] http://dumps.wikimedia.org/other/pagecounts-all-sites/ > > > > > On 03/21/2016 10:11 AM, Michal Bystricky wrote: > > We would like to have URI addresses of requests for some time of usage - > let's say 1 month. > > According to the data format > <https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest>, the > attributes of Webrequests we need are following: > > http_method, > uri_host, > uri_path, > uri_query, > ts, > access_method, > agent_type, > pageview_info, > page_id > > Do we need to go through NDA process or it is possible to get the data > right away from the public dataset? > > Thank you, > > M. > > > > Can you be more specific about what you need, Michal? If you truly need > access to the private data that we keep in wmf.webrequest for a limited > time, then you'd have to go through a process to sign an NDA. But if you > tell us what you need, there may be a public dataset that you can use. > > On Thu, Mar 3, 2016 at 2:48 PM, Michal Bystricky < > <[email protected]>[email protected]> wrote: > >> Hello Analytics Team, >> >> We would like to have one-time access to wmf.webrequest data. What is the >> correct way of accessing the data? >> >> In our research group, we want to simulate the requests for specific >> version of WikiMedia. >> >> Thanks, >> Michal Bystricky >> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > > > _______________________________________________ > Analytics mailing > [email protected]https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
