Following up on this thread to keep archives happy. dataset has been compiled and it is avialable here: https://datasets.wikimedia.org/public-datasets/analytics/caching/
If you are interested caching data please read ticket as there are data nuances you should know about: https://phabricator.wikimedia.org/T128132 On Thu, Feb 25, 2016 at 1:43 PM, Daniel Berger <[email protected]> wrote: > Alright, the corresponding task can be found here: > https://phabricator.wikimedia.org/T128132 > > Thanks a lot for your help Nuria and Tim! > Daniel > > > On 02/25/2016 09:58 PM, Nuria Ruiz wrote: > >>How do we proceed from here? > > You can open a phabricator item, explain your request, tag it with > > "analytics" tag and it will go in our backlog. > > > > Phabricator: https://phabricator.wikimedia.org > > > > Our backlog: https://phabricator.wikimedia.org/tag/analytics/ > > > > What we are currently working > > on: https://phabricator.wikimedia.org/tag/analytics-kanban/ > > > > Our team focus on infrastructure for analytics rather than compiling > > "ad-hoc" datasets. Since most requests are about edit or pageview data > > normally those are either granted by existing datasets, collaborations > > with research team or analysts working for other teams on the > > organization. Now, we understand this data request does not fit in > > either of those so that is why I am suggesting to put it on our backlog > > and our team will look at it. > > > > Thanks, > > > > Nuria > > > > > > > > > > On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <[email protected] > > <mailto:[email protected]>> wrote: > > > > Thank you, Nuria, for pointing me to the right doc. This looks great! > > > > Do I correctly understand that we can compile a trace with all > requests > > (or with a high sampling rate like 1:10) the 'refined' webrequest > data? > > > > We can go without request size. The following fields would be > important > > - ts timestamp in ms (to save bytes) > > - uri_host > > - uri_path > > - uri_query needed for save flag > > - cache_status needed for save flag > > - http_method needed for save flag > > - response_size > > > > Additionally, it would be interesting to have > > - hostname to study cache load balancing > > - sequence to uniquely order requests below ms > > - content_type to study hit rates per content type > > - access_method to study hit rates per access type > > - time_firstbyte for performance/latency comparison > > - x_cache more cache statistics (cache hierarchy) > > > > > > How do we proceed from here? > > > > I guess it would make sense to first look at a tiny data set to > verify > > we have what we need. I'm thinking about a few tens of requests? > > > > > > Thanks a lot for your time! > > Daniel > > > > > > > > > > On 02/25/2016 05:55 PM, Nuria Ruiz wrote: > > > Daniel, > > > > > > Took a second look at our dataset (FYI, we have not used sampled > logs > > > for a while now for this type of data) and hey, cache_status, > cache_host > > > and response size are right there. So, my mistake when I thought > those > > > were not included. > > > > > > See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest > > > > > > So the only thing not available is request_size. No awk is needed > as > > > this data is available on hive for the last month. Take a look at > docs > > > and let us know. > > > > > > Thanks, > > > > > > Nuria > > > > > > > > > > > > On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger < > [email protected] <mailto:[email protected]> > > > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > > > > > Tim, thanks a lot. Your scripts show that we can get > > everything from the > > > cache log format. > > > > > > > > > What is the current sampling rate for the cache logs in > > > /a/log/webrequest/archive? > > > I understand that wikitech's wiki information > > > - 1:1000 for the general request stream [1], and > > > - 1:100 for the mobile request stream [2] > > > might be outdated? > > > > > > The 2007 trace had a 1:10 sampling rate, which means much more > > data. > > > Would 1:10 still be feasible today? > > > > > > A high sampling rate would be important to reproduce the cache > > hit ratio > > > as seen by the varnish caches. However, this depends on how > > the caches > > > are load balanced. > > > If requests get distributed round robin (and there are many > > caches), > > > then a 1:100 sampling rate would probably be enough to > > reproduce their > > > hit rate. > > > If, requests get distributed by hashing over URLs (or > > similar), then we > > > might need a higher sampling rate (like 1:10) to capture the > > request > > > stream's temporal locality. > > > > > > > > > Starting from the fields of the 2007 trace, it would be > > important to > > > include > > > - the request size $7 > > > and it would be helpful to include > > > - the cache hostname $1 > > > - the cache request status $6 > > > > > > Building on your awk script, this would be something along > > > > > > function savemark(url, code) { > > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > > return "save" > > > return "-" > > > } > > > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > > print $1, $3, $4, $9, $7, savemark($9, $6), $6 > > > } > > > > > > > > > Would this be an acceptable format? > > > > > > Let me know your thoughts. > > > > > > > > > Thanks a lot, > > > Daniel > > > > > > > > > [1] > > > > > https://wikitech.wikimedia.org/wiki/Analytics/Data/ > Webrequests_sampled > > > > > > [2] > > > > > https://wikitech.wikimedia.org/wiki/Analytics/Data/ > Mobile_requests_stream > > > > > > > > > > > > > > > > > > > > > On 02/25/2016 12:04 PM, Tim Starling wrote: > > > > On 25/02/16 21:14, Daniel Berger wrote: > > > >> Nuria, thank you for pointing out that exporting a save > > flag for each > > > >> request will be complicated. I wasn't aware of that. > > > >> > > > >> It would be very interesting to learn how the previous data > > set's > > > save > > > >> flag was exported back in 2007. > > > > > > > > As I suspected in my offlist post, the save flag was set > > using the > > > > HTTP response code. Here are the files as they were when > > they were > > > > first committed to version control in 2012. I think they > > were the same > > > > in 2007 except for the IP address filter: > > > > > > > > vu.awk: > > > > > > > > function savemark(url, code) { > > > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > > > return "save" > > > > return "-" > > > > } > > > > > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > > > print $3, $9, savemark($9, $6) > > > > } > > > > > > > > > > > > urjc.awk: > > > > > > > > function savemark(url, code) { > > > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > > > return "save" > > > > return "-" > > > > } > > > > > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > > > print $3, $9, savemark($9, $6), $4, $8 > > > > } > > > > > > > > > > > > -- Tim Starling > > > > > > > > > > _______________________________________________ > > > Analytics mailing list > > > [email protected] > > <mailto:[email protected]> > > <mailto:[email protected] > > <mailto:[email protected]>> > > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > > > > > > > > > > _______________________________________________ > > > Analytics mailing list > > > [email protected] <mailto:Analytics@lists. > wikimedia.org> > > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] <mailto:[email protected]> > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
