On 8 January 2015 at 03:02, Gergo Tisza <[email protected]> wrote: > On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes <[email protected]> wrote: >> >> We get 120,000 requests a second. We're not storing them all for six >> months. But we do have sampled logs going back that far. > > > That would be great! Are those in Hadoop? > > On Wed, Jan 7, 2015 at 11:36 PM, Oliver Keyes <[email protected]> wrote: >> >> Not particularly, I don't think - except to remember that namespace >> names are localised, so you're going to have a whale of a time >> matching them (unless you just look for file endings, I guess). > > > In the case of NavigationTiming the nsid is recorded, so that wasn't a > problem; but it has only been added around May, so for the period before > that there is no namespace information at all. > > Localized file namespace doesn't sound so bad - I can look up all > translations in Translatewiki, and construct a regexp or a similar > condition. There could be fun exceptions like namespace translations which > have changed recently, but I would be fine with assuming the error caused by > that is not significant.
Well, yes; a 750-option regex run over 6 million rows for a day of data. A whale of a time ;p. You can also just use the API's namespaceNames and namespaceAliases code. > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
