Re: [Analytics] Request stream data set for cache tuning

Nuria Ruiz Wed, 31 Aug 2016 09:41:20 -0700

Following up on this thread to keep archives happy. dataset has been
compiled and it is avialable here:
https://datasets.wikimedia.org/public-datasets/analytics/caching/


If you are interested caching data please read ticket as there are data
nuances you should know about: https://phabricator.wikimedia.org/T128132


On Thu, Feb 25, 2016 at 1:43 PM, Daniel Berger <[email protected]> wrote:

> Alright, the corresponding task can be found here:
> https://phabricator.wikimedia.org/T128132
>
> Thanks a lot for your help Nuria and Tim!
> Daniel
>
>
> On 02/25/2016 09:58 PM, Nuria Ruiz wrote:
> >>How do we proceed from here?
> > You can open a phabricator item, explain your request, tag it with
> > "analytics" tag and it will go in our backlog.
> >
> > Phabricator: https://phabricator.wikimedia.org
> >
> > Our backlog: https://phabricator.wikimedia.org/tag/analytics/
> >
> > What we are currently working
> > on: https://phabricator.wikimedia.org/tag/analytics-kanban/
> >
> > Our team focus on infrastructure for analytics rather than compiling
> > "ad-hoc" datasets. Since most requests are about edit or pageview data
> > normally those are either granted by existing datasets, collaborations
> > with research team or analysts working for other teams on the
> > organization. Now, we understand this data request does not fit in
> > either of those so that is why I am suggesting to put it on our backlog
> > and our team will look at it.
> >
> > Thanks,
> >
> > Nuria
> >
> >
> >
> >
> > On Thu, Feb 25, 2016 at 12:42 PM, Daniel Berger <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Thank you, Nuria, for pointing me to the right doc. This looks great!
> >
> >     Do I correctly understand that we can compile a trace with all
> requests
> >     (or with a high sampling rate like 1:10) the 'refined' webrequest
> data?
> >
> >     We can go without request size. The following fields would be
> important
> >     - ts           timestamp in ms (to save bytes)
> >     - uri_host
> >     - uri_path
> >     - uri_query     needed for save flag
> >     - cache_status  needed for save flag
> >     - http_method   needed for save flag
> >     - response_size
> >
> >     Additionally, it would be interesting to have
> >     - hostname      to study cache load balancing
> >     - sequence      to uniquely order requests below ms
> >     - content_type  to study hit rates per content type
> >     - access_method   to study hit rates per access type
> >     - time_firstbyte  for performance/latency comparison
> >     - x_cache       more cache statistics (cache hierarchy)
> >
> >
> >     How do we proceed from here?
> >
> >     I guess it would make sense to first look at a tiny data set to
> verify
> >     we have what we need. I'm thinking about a few tens of requests?
> >
> >
> >     Thanks a lot for your time!
> >     Daniel
> >
> >
> >
> >
> >     On 02/25/2016 05:55 PM, Nuria Ruiz wrote:
> >     > Daniel,
> >     >
> >     > Took a second look at our dataset (FYI, we have not used sampled
> logs
> >     > for a while now for this type of data) and hey, cache_status,
> cache_host
> >     > and response size are right there. So, my mistake when I thought
> those
> >     > were not included.
> >     >
> >     > See: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
> >     >
> >     > So the only thing not available is request_size.  No awk is needed
> as
> >     > this data is available on hive for the last month. Take a look at
> docs
> >     > and let us know.
> >     >
> >     > Thanks,
> >     >
> >     > Nuria
> >     >
> >     >
> >     >
> >     > On Thu, Feb 25, 2016 at 7:50 AM, Daniel Berger <
> [email protected] <mailto:[email protected]>
> >     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >     >
> >     >     Tim, thanks a lot. Your scripts show that we can get
> >     everything from the
> >     >     cache log format.
> >     >
> >     >
> >     >     What is the current sampling rate for the cache logs in
> >     >     /a/log/webrequest/archive?
> >     >     I understand that wikitech's wiki information
> >     >      - 1:1000 for the general request stream [1], and
> >     >      - 1:100 for the mobile request stream [2]
> >     >     might be outdated?
> >     >
> >     >     The 2007 trace had a 1:10 sampling rate, which means much more
> >     data.
> >     >     Would 1:10 still be feasible today?
> >     >
> >     >     A high sampling rate would be important to reproduce the cache
> >     hit ratio
> >     >     as seen by the varnish caches. However, this depends on how
> >     the caches
> >     >     are load balanced.
> >     >     If requests get distributed round robin (and there are many
> >     caches),
> >     >     then a 1:100 sampling rate would probably be enough to
> >     reproduce their
> >     >     hit rate.
> >     >     If, requests get distributed by hashing over URLs (or
> >     similar), then we
> >     >     might need a higher sampling rate (like 1:10) to capture the
> >     request
> >     >     stream's temporal locality.
> >     >
> >     >
> >     >     Starting from the fields of the 2007 trace, it would be
> >     important to
> >     >     include
> >     >      - the request size $7
> >     >     and it would be helpful to include
> >     >      - the cache hostname $1
> >     >      - the cache request status $6
> >     >
> >     >     Building on your awk script, this would be something along
> >     >
> >     >      function savemark(url, code) {
> >     >         if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> >     >             return "save"
> >     >         return "-"
> >     >      }
> >     >
> >     >      $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> >     >         print $1, $3, $4, $9, $7, savemark($9, $6), $6
> >     >      }
> >     >
> >     >
> >     >     Would this be an acceptable format?
> >     >
> >     >     Let me know your thoughts.
> >     >
> >     >
> >     >     Thanks a lot,
> >     >     Daniel
> >     >
> >     >
> >     >     [1]
> >     >
> >      https://wikitech.wikimedia.org/wiki/Analytics/Data/
> Webrequests_sampled
> >     >
> >     >     [2]
> >     >
> >      https://wikitech.wikimedia.org/wiki/Analytics/Data/
> Mobile_requests_stream
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >     On 02/25/2016 12:04 PM, Tim Starling wrote:
> >     >     > On 25/02/16 21:14, Daniel Berger wrote:
> >     >     >> Nuria, thank you for pointing out that exporting a save
> >     flag for each
> >     >     >> request will be complicated. I wasn't aware of that.
> >     >     >>
> >     >     >> It would be very interesting to learn how the previous data
> >     set's
> >     >     save
> >     >     >> flag was exported back in 2007.
> >     >     >
> >     >     > As I suspected in my offlist post, the save flag was set
> >     using the
> >     >     > HTTP response code. Here are the files as they were when
> >     they were
> >     >     > first committed to version control in 2012. I think they
> >     were the same
> >     >     > in 2007 except for the IP address filter:
> >     >     >
> >     >     > vu.awk:
> >     >     >
> >     >     > function savemark(url, code) {
> >     >     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> >     >     >         return "save"
> >     >     >     return "-"
> >     >     > }
> >     >     >
> >     >     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> >     >     >     print $3, $9, savemark($9, $6)
> >     >     > }
> >     >     >
> >     >     >
> >     >     > urjc.awk:
> >     >     >
> >     >     > function savemark(url, code) {
> >     >     >     if (url ~ /action=submit$/ && code == "TCP_MISS/302")
> >     >     >         return "save"
> >     >     >     return "-"
> >     >     > }
> >     >     >
> >     >     > $5 !~ /^(145\.97\.39\.|66\.230\.200\.|211\.115\.107\.)/ {
> >     >     >     print $3, $9, savemark($9, $6), $4, $8
> >     >     > }
> >     >     >
> >     >     >
> >     >     > -- Tim Starling
> >     >     >
> >     >
> >     >     _______________________________________________
> >     >     Analytics mailing list
> >     >     [email protected]
> >     <mailto:[email protected]>
> >     <mailto:[email protected]
> >     <mailto:[email protected]>>
> >     >     https://lists.wikimedia.org/mailman/listinfo/analytics
> >     >
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > Analytics mailing list
> >     > [email protected] <mailto:Analytics@lists.
> wikimedia.org>
> >     > https://lists.wikimedia.org/mailman/listinfo/analytics
> >     >
> >
> >     _______________________________________________
> >     Analytics mailing list
> >     [email protected] <mailto:[email protected]>
> >     https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> >
> >
> > _______________________________________________
> > Analytics mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Request stream data set for cache tuning

Reply via email to