Thank you, Nuria, for pointing me to the right doc. This looks great!

Do I correctly understand that we can compile a trace with all requests
(or with a high sampling rate like 1:10) the 'refined' webrequest data?

We can go without request size. The following fields would be important
- ts           timestamp in ms (to save bytes)
- uri_host
- uri_path
- uri_query     needed for save flag
- cache_status  needed for save flag
- http_method   needed for save flag
- response_size

Additionally, it would be interesting to have
- hostname      to study cache load balancing
- sequence      to uniquely order requests below ms
- content_type  to study hit rates per content type
- access_method   to study hit rates per access type
- time_firstbyte  for performance/latency comparison
- x_cache       more cache statistics (cache hierarchy)


How do we proceed from here?

I guess it would make sense to first look at a tiny data set to verify
we have what we need. I'm thinking about a few tens of requests?


Thanks a lot for your time!
Daniel




On 02/25/2016 05:55 PM, Nuria Ruiz wrote:
> Daniel, 
> 
> Took a second look at our dataset (FYI, we have not used sampled logs
> for a while now for this type of data) and hey, cache_status, cache_host
> and response size are right there. So, my mistake when I thought those
> were not included.
> 
> See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
> 
> So the only thing not available is request_size.  No awk is needed as
> this data is available on hive for the last month. Take a look at docs
> and let us know.
> 
> Thanks, 
> 
> Nuria
> 
> 
> 
> On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Tim, thanks a lot. Your scripts show that we can get everything from the
>     cache log format.
> 
> 
>     What is the current sampling rate for the cache logs in
>     /a/log/webrequest/archive?
>     I understand that wikitech's wiki information
>      - 1:1000 for the general request stream [1], and
>      - 1:100 for the mobile request stream [2]
>     might be outdated?
> 
>     The 2007 trace had a 1:10 sampling rate, which means much more data.
>     Would 1:10 still be feasible today?
> 
>     A high sampling rate would be important to reproduce the cache hit ratio
>     as seen by the varnish caches. However, this depends on how the caches
>     are load balanced.
>     If requests get distributed round robin (and there are many caches),
>     then a 1:100 sampling rate would probably be enough to reproduce their
>     hit rate.
>     If, requests get distributed by hashing over URLs (or similar), then we
>     might need a higher sampling rate (like 1:10) to capture the request
>     stream's temporal locality.
> 
> 
>     Starting from the fields of the 2007 trace, it would be important to
>     include
>      - the request size $7
>     and it would be helpful to include
>      - the cache hostname $1
>      - the cache request status $6
> 
>     Building on your awk script, this would be something along
> 
>      function savemark(url, code) {
>         if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>             return "save"
>         return "-"
>      }
> 
>      $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>         print $1, $3, $4, $9, $7, savemark($9, $6), $6
>      }
> 
> 
>     Would this be an acceptable format?
> 
>     Let me know your thoughts.
> 
> 
>     Thanks a lot,
>     Daniel
> 
> 
>     [1]
>     https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled
> 
>     [2]
>     https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
> 
> 
> 
> 
> 
> 
>     On 02/25/2016 12:04 PM, Tim Starling wrote:
>     > On 25/02/16 21:14, Daniel Berger wrote:
>     >> Nuria, thank you for pointing out that exporting a save flag for each
>     >> request will be complicated. I wasn't aware of that.
>     >>
>     >> It would be very interesting to learn how the previous data set's
>     save
>     >> flag was exported back in 2007.
>     >
>     > As I suspected in my offlist post, the save flag was set using the
>     > HTTP response code. Here are the files as they were when they were
>     > first committed to version control in 2012. I think they were the same
>     > in 2007 except for the IP address filter:
>     >
>     > vu.awk:
>     >
>     > function savemark(url, code) {
>     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >         return "save"
>     >     return "-"
>     > }
>     >
>     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     print $3, $9, savemark($9, $6)
>     > }
>     >
>     >
>     > urjc.awk:
>     >
>     > function savemark(url, code) {
>     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
>     >         return "save"
>     >     return "-"
>     > }
>     >
>     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
>     >     print $3, $9, savemark($9, $6), $4, $8
>     > }
>     >
>     >
>     > -- Tim Starling
>     >
> 
>     _______________________________________________
>     Analytics mailing list
>     [email protected] <mailto:[email protected]>
>     https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to