Daniel, Took a second look at our dataset (FYI, we have not used sampled logs for a while now for this type of data) and hey, cache_status, cache_host and response size are right there. So, my mistake when I thought those were not included.
See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest So the only thing not available is request_size. No awk is needed as this data is available on hive for the last month. Take a look at docs and let us know. Thanks, Nuria On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <[email protected]> wrote: > Tim, thanks a lot. Your scripts show that we can get everything from the > cache log format. > > > What is the current sampling rate for the cache logs in > /a/log/webrequest/archive? > I understand that wikitech's wiki information > - 1:1000 for the general request stream [1], and > - 1:100 for the mobile request stream [2] > might be outdated? > > The 2007 trace had a 1:10 sampling rate, which means much more data. > Would 1:10 still be feasible today? > > A high sampling rate would be important to reproduce the cache hit ratio > as seen by the varnish caches. However, this depends on how the caches > are load balanced. > If requests get distributed round robin (and there are many caches), > then a 1:100 sampling rate would probably be enough to reproduce their > hit rate. > If, requests get distributed by hashing over URLs (or similar), then we > might need a higher sampling rate (like 1:10) to capture the request > stream's temporal locality. > > > Starting from the fields of the 2007 trace, it would be important to > include > - the request size $7 > and it would be helpful to include > - the cache hostname $1 > - the cache request status $6 > > Building on your awk script, this would be something along > > function savemark(url, code) { > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > return "save" > return "-" > } > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > print $1, $3, $4, $9, $7, savemark($9, $6), $6 > } > > > Would this be an acceptable format? > > Let me know your thoughts. > > > Thanks a lot, > Daniel > > > [1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled > > [2] > https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream > > > > > > > On 02/25/2016 12:04 PM, Tim Starling wrote: > > On 25/02/16 21:14, Daniel Berger wrote: > >> Nuria, thank you for pointing out that exporting a save flag for each > >> request will be complicated. I wasn't aware of that. > >> > >> It would be very interesting to learn how the previous data set's save > >> flag was exported back in 2007. > > > > As I suspected in my offlist post, the save flag was set using the > > HTTP response code. Here are the files as they were when they were > > first committed to version control in 2012. I think they were the same > > in 2007 except for the IP address filter: > > > > vu.awk: > > > > function savemark(url, code) { > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > return "save" > > return "-" > > } > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > print $3, $9, savemark($9, $6) > > } > > > > > > urjc.awk: > > > > function savemark(url, code) { > > if (url ~ /action=submit$/ && code == "TCP_MISS/302") > > return "save" > > return "-" > > } > > > > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ { > > print $3, $9, savemark($9, $6), $4, $8 > > } > > > > > > -- Tim Starling > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
