Yes and no. So, we use a sliightly more expanded version of the
ua-parser bot filtering (for example, detecting automata - wget and
Twisted Pagegetter are not bots, but they should absolutely be
filtered) and a slightly more expanded spider detection approach
(there are Wikimedia-specific spiders). To me the greater risk is
undeclared automata; I've had quite a lot of success detecting them
using various concentration and density indexes, such as the
Herfindahl, orienting around {ip,xff} tuples or user agents, but it
requires >=1,000 pageviews to a particular URL to be useful.So, there is more we can do - but it becomes complex and computationally intensive, and requires constant hand-coding to maintain. I have much sympathy for whoever it is in R&D who has to absorb my work, because a lot of it is maintaining things like this, and pageviews are of limited utility for most purposes without this kind of filtering. On 26 February 2015 at 02:31, Federico Leva (Nemo) <[email protected]> wrote: > Erik Zachte, 25/02/2015 23:34: >> >> Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and >> >> http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageBreakdown.htm > > > Ironholds' looks more vulnerable to bots, it's easier to see in small wikis > (though, kudos! many more small wikis are included than in wikistats). For > instance, 20 more percentage points for USA on Breton and Bavarian > Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese > bots they look similar, though in some cases I'm not sure what's going on: > for instance als.wiki also sees CH and RO emerge. > > Will the new pageviews definition use the same bot filtering method? > > Nemo > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
