On Thu, Jan 8, 2015 at 3:02 AM, Gergo Tisza <[email protected]> wrote:
> On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes <[email protected]> wrote: > >> We get 120,000 requests a second. We're not storing them all for six >> months. But we do have sampled logs going back that far. > > > That would be great! Are those in Hadoop? > They're on stat1002 in /a/squid/archive/sampled/ And the webrequest format is: https://wikitech.wikimedia.org/wiki/Cache_log_format Note that the namespaces only show up in the title of the pages in the raw URL, so it's still going to be a bit painful to parse them out. But folks around here have done stuff like that, maybe someone can chime in with some handy scripts?
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
