Daniel,

Took a second look at our dataset (FYI, we have not used sampled logs for a
while now for this type of data) and hey, cache_status, cache_host and
response size are right there. So, my mistake when I thought those were not
included.

See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

So the only thing not available is request_size.  No awk is needed as this
data is available on hive for the last month. Take a look at docs and let
us know.

Thanks,

Nuria



On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <[email protected]> wrote:

> Tim, thanks a lot. Your scripts show that we can get everything from the
> cache log format.
>
>
> What is the current sampling rate for the cache logs in
> /a/log/webrequest/archive?
> I understand that wikitech's wiki information
>  - 1:1000 for the general request stream [1], and
>  - 1:100 for the mobile request stream [2]
> might be outdated?
>
> The 2007 trace had a 1:10 sampling rate, which means much more data.
> Would 1:10 still be feasible today?
>
> A high sampling rate would be important to reproduce the cache hit ratio
> as seen by the varnish caches. However, this depends on how the caches
> are load balanced.
> If requests get distributed round robin (and there are many caches),
> then a 1:100 sampling rate would probably be enough to reproduce their
> hit rate.
> If, requests get distributed by hashing over URLs (or similar), then we
> might need a higher sampling rate (like 1:10) to capture the request
> stream's temporal locality.
>
>
> Starting from the fields of the 2007 trace, it would be important to
> include
>  - the request size $7
> and it would be helpful to include
>  - the cache hostname $1
>  - the cache request status $6
>
> Building on your awk script, this would be something along
>
>  function savemark(url, code) {
>     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>         return "save"
>     return "-"
>  }
>
>  $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     print $1, $3, $4, $9, $7, savemark($9, $6), $6
>  }
>
>
> Would this be an acceptable format?
>
> Let me know your thoughts.
>
>
> Thanks a lot,
> Daniel
>
>
> [1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled
>
> [2]
> https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
>
>
>
>
>
>
> On 02/25/2016 12:04 PM, Tim Starling wrote:
> > On 25/02/16 21:14, Daniel Berger wrote:
> >> Nuria, thank you for pointing out that exporting a save flag for each
> >> request will be complicated. I wasn't aware of that.
> >>
> >> It would be very interesting to learn how the previous data set's save
> >> flag was exported back in 2007.
> >
> > As I suspected in my offlist post, the save flag was set using the
> > HTTP response code. Here are the files as they were when they were
> > first committed to version control in 2012. I think they were the same
> > in 2007 except for the IP address filter:
> >
> > vu.awk:
> >
> > function savemark(url, code) {
> >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> >         return "save"
> >     return "-"
> > }
> >
> > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> >     print $3, $9, savemark($9, $6)
> > }
> >
> >
> > urjc.awk:
> >
> > function savemark(url, code) {
> >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> >         return "save"
> >     return "-"
> > }
> >
> > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> >     print $3, $9, savemark($9, $6), $4, $8
> > }
> >
> >
> > -- Tim Starling
> >
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to