On Thu, Jan 8, 2015 at 3:02 AM, Gergo Tisza <[email protected]> wrote:

> On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes <[email protected]> wrote:
>
>> We get 120,000 requests a second. We're not storing them all for six
>> months. But we do have sampled logs going back that far.
>
>
> That would be great! Are those in Hadoop?
>

They're on stat1002 in /a/squid/archive/sampled/

And the webrequest format is:
https://wikitech.wikimedia.org/wiki/Cache_log_format

Note that the namespaces only show up in the title of the pages in the raw
URL, so it's still going to be a bit painful to parse them out.  But folks
around here have done stuff like that, maybe someone can chime in with some
handy scripts?
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to