Nuria, thank you for pointing out that exporting a save flag for each
request will be complicated. I wasn't aware of that.

It would be very interesting to learn how the previous data set's save
flag was exported back in 2007.


Maybe it would be possible to derive a save flag with data already
available to the analytics infrastructure (stat1002's requests streams).
Here are two naive ideas.

1) In Wikimedia's cache log format [1], I can see that the request
method (%m) is logged. Wouldn't the request method allow us to detect
POST requests and thus setting the save flag?
Maybe the log even includes the PURGE requests triggered by a save
operation?

2) We can try detecting object updates by changes in their size.
Specifically, we would need to know the response size and whether the
response was gzipped. Without knowing whether a response was gzipped we
might be detecting many spurious object updates.
Unfortunately, it seems that the cache log format [1] does not include
the Content-Encoding so that we would be able to detect gzipped responses?

Best,
Daniel

[1] https://wikitech.wikimedia.org/wiki/Cache_log_format


On 02/24/2016 09:59 PM, Nuria Ruiz wrote:
> (cc-ing Tim starling who is credited on your dataset page and might know
> more about this)
>>I would like to ask for your comments about compiling a similar
> (updated) data set and making it public.
> 
> 
> As far as I can see the prior dataset contained the following:
> 
> Counter, timestamp, url, save flag
> 
> 929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png -
> 929840891 1190146243.303 http://en.wikipedia.org/images/wiki-en.png  save 
> 
> I can see how we could get a dataset with timestamp and url and adding a
> counter is something it can be done (on our actual system though
> ordering of requests is not guranteed in logs). Now, I really do not
> know whether it is possible to add a flag of whether the request was a
> save or not. As far as I know that is not information we have on our
> current system and it seems that it will require tapping into the cache
> lookups to get that info. Meaning that you would need to get that info
> from varnish lookups as requests are happening which is before analytics
> systems get any of the data.
> 
> Anyways I hope other folks can chime in on how/whether this can be done
> somewhat easily, it certainly requires access to other parts of the
> stack besides analytics infrastructure.
> 
> 
> Thanks, 
> 
> Nuria
> 
> 
> 
> 
> 
> 
> 
> 
> On Wed, Feb 24, 2016 at 3:05 AM, Daniel Berger <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi everyone,
> 
>     I'm a phd student studying mathematical models to improve the hit
>     ratio of web caches. In my research community, we are lacking
>     realistic data sets and frequently rely on outdated modelling
>     assumptions.
> 
>     Previously, (~2007) a trace containing 10% of user requests issued
>     to the Wikipedia was publicly released [1]. This data set has been
>     used widely for performance evaluations of new caching algorithms,
>     e.g., for the new Caffeine caching framework for Java [2].
> 
>     I would like to ask for your comments about compiling a similar
>     (updated) data set and making it public.
> 
> 
>     In my understanding, the necessary logs are readily available, e.g.,
>     in the Analytics/Data/Mobile requests stream [3] on stat1002, with a
>     sampling rate of 1:100. As this request stream contains sensitive
>     data (e.g., client IPs), it would need anonymization before making
>     it public. It would be glad to help with that.
> 
>     The previously released data set [1] contains no client information.
>     It contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an
>     update flag. I would additionally suggest to include 5) the cache's
>     hostname, 6) the cache_status, and 7) the response size (from the
>     Wikimedia cache log format).
>     I believe this format would preserve anonymity, and would be
>     interesting for many researchers.
> 
>     Let me know your thoughts.
> 
>     Thanks,
>     Daniel Berger
>     http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
> 
>     [1] http://www.wikibench.eu/?page_id=60
>     [2] https://github.com/ben-manes/caffeine/wiki/Efficiency
>     [3]
>     https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
> 
>     _______________________________________________
>     Analytics mailing list
>     [email protected] <mailto:[email protected]>
>     https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to