Yes and no. So, we use a sliightly more expanded version of the
ua-parser bot filtering (for example, detecting automata - wget and
Twisted Pagegetter are not bots, but they should absolutely be
filtered) and a slightly more expanded spider detection approach
(there are Wikimedia-specific spiders). To me the greater risk is
undeclared automata; I've had quite a lot of success detecting them
using various concentration and density indexes, such as the
Herfindahl, orienting around {ip,xff} tuples or user agents, but it
requires >=1,000 pageviews to a particular URL to be useful.

So, there is more we can do - but it becomes complex and
computationally intensive, and requires constant hand-coding to
maintain. I have much sympathy for whoever it is in R&D who has to
absorb my work, because a lot of it is maintaining things like this,
and pageviews are of limited utility for most purposes without this
kind of filtering.

On 26 February 2015 at 02:31, Federico Leva (Nemo) <[email protected]> wrote:
> Erik Zachte, 25/02/2015 23:34:
>>
>> Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/  and
>>
>> http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageBreakdown.htm
>
>
> Ironholds' looks more vulnerable to bots, it's easier to see in small wikis
> (though, kudos! many more small wikis are included than in wikistats). For
> instance, 20 more percentage points for USA on Breton and Bavarian
> Wikipedias, 30 on Welsh, 40 on Alemannic, almost 50 on Kurdish. For Chinese
> bots they look similar, though in some cases I'm not sure what's going on:
> for instance als.wiki also sees CH and RO emerge.
>
> Will the new pageviews definition use the same bot filtering method?
>
> Nemo
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to