On 8 January 2015 at 03:02, Gergo Tisza <[email protected]> wrote:
> On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes <[email protected]> wrote:
>>
>> We get 120,000 requests a second. We're not storing them all for six
>> months. But we do have sampled logs going back that far.
>
>
> That would be great! Are those in Hadoop?
>
> On Wed, Jan 7, 2015 at 11:36 PM, Oliver Keyes <[email protected]> wrote:
>>
>> Not particularly, I don't think - except to remember that namespace
>> names are localised, so you're going to have a whale of a time
>> matching them (unless you just look for file endings, I guess).
>
>
> In the case of NavigationTiming the nsid is recorded, so that wasn't a
> problem; but it has only been added around May, so for the period before
> that there is no namespace information at all.
>
> Localized file namespace doesn't sound so bad - I can look up all
> translations in Translatewiki, and construct a regexp or a similar
> condition. There could be fun exceptions like namespace translations which
> have changed recently, but I would be fine with assuming the error caused by
> that is not significant.

Well, yes; a 750-option regex run over 6 million rows for a day of
data. A whale of a time ;p. You can also just use the API's
namespaceNames and namespaceAliases code.

>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to