Neil:

Some of  the rules used to identify automated traffic have been used by the
community for now couple years. See for example [1] and [2].  For more
information you can always ping us.

Thanks,

Nuria

[1] https://tools.wmflabs.org/topviews/faq/#false_positive
[2] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions



On Wed, May 13, 2020 at 7:44 AM Neil Shah-Quinn <[email protected]>
wrote:

> Nuria,
>
> Thank you for this update! I'm very excited about this new system.
>
> I did notice that there's not much explanation of the particular rules or
> strategies that are used to identify automated traffic, or a link to the
> implementing code. I can imagine this might be intentional, to make it
> harder for the spammers and vandals to evade the system. If so, it would be
> helpful to update the page to say that explicitly and explain how people
> can request more details if they have a legitimate need for them.
>
> On Tue, 5 May 2020 at 02:40, Nuria Ruiz <[email protected]> wrote:
>
>> Hello:
>>
>> We have added the 'automated' maker to Wikimedia's pageview data. Up to
>> now pageview agents were classified as 'spider' (self reported bots like
>> 'google bot' or 'bing bot') and 'user'.
>>
>> We have known for a while that some requests classified as 'user' were,
>> in fact, coming from automated agents not disclosed as such. This was a
>> well known fact for our community as for a couple years now they have been
>> applying filtering rules for any "Top X" list compiled [1]. We have
>> incorporated some of these filters (and others) to our automated traffic
>> detection and, as of this week, traffic that meets the filtering
>> criteria is now automatically excluded from being counted towards "top"
>> lists reported by the pageview API.
>>
>> The effect of removing pageviews marked as 'automated' from the overall
>> user traffic is about a 5.6% reduction of pageviews labeled as "user" [2]
>> in the course of  a month. Not all projects are affected equally when it
>> comes to reduction of "user pageviews". The biggest effect is on English
>> Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly
>> affected (< 1%).
>>
>> If you are curious as what problems this type of traffic causes in the
>> data, this ticket for Hungarian Wikipedia is a good example of issues
>> inflicted by what we call "bot vandalism/bot spam":
>> https://phabricator.wikimedia.org/T237282
>>
>> Given the delicate nature of this data we have worked for many months now
>> on vetting the algorithms we are using. We will appreciate reports via phab
>> ticket for any issues you might find.
>>
>> Thanks,
>>
>> Nuria
>>
>> [1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions
>> [2]
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Global_Impact_-_All_wikimedia_projects
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to