Neil: Some of the rules used to identify automated traffic have been used by the community for now couple years. See for example [1] and [2]. For more information you can always ping us.
Thanks, Nuria [1] https://tools.wmflabs.org/topviews/faq/#false_positive [2] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions On Wed, May 13, 2020 at 7:44 AM Neil Shah-Quinn <[email protected]> wrote: > Nuria, > > Thank you for this update! I'm very excited about this new system. > > I did notice that there's not much explanation of the particular rules or > strategies that are used to identify automated traffic, or a link to the > implementing code. I can imagine this might be intentional, to make it > harder for the spammers and vandals to evade the system. If so, it would be > helpful to update the page to say that explicitly and explain how people > can request more details if they have a legitimate need for them. > > On Tue, 5 May 2020 at 02:40, Nuria Ruiz <[email protected]> wrote: > >> Hello: >> >> We have added the 'automated' maker to Wikimedia's pageview data. Up to >> now pageview agents were classified as 'spider' (self reported bots like >> 'google bot' or 'bing bot') and 'user'. >> >> We have known for a while that some requests classified as 'user' were, >> in fact, coming from automated agents not disclosed as such. This was a >> well known fact for our community as for a couple years now they have been >> applying filtering rules for any "Top X" list compiled [1]. We have >> incorporated some of these filters (and others) to our automated traffic >> detection and, as of this week, traffic that meets the filtering >> criteria is now automatically excluded from being counted towards "top" >> lists reported by the pageview API. >> >> The effect of removing pageviews marked as 'automated' from the overall >> user traffic is about a 5.6% reduction of pageviews labeled as "user" [2] >> in the course of a month. Not all projects are affected equally when it >> comes to reduction of "user pageviews". The biggest effect is on English >> Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly >> affected (< 1%). >> >> If you are curious as what problems this type of traffic causes in the >> data, this ticket for Hungarian Wikipedia is a good example of issues >> inflicted by what we call "bot vandalism/bot spam": >> https://phabricator.wikimedia.org/T237282 >> >> Given the delicate nature of this data we have worked for many months now >> on vetting the algorithms we are using. We will appreciate reports via phab >> ticket for any issues you might find. >> >> Thanks, >> >> Nuria >> >> [1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions >> [2] >> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Global_Impact_-_All_wikimedia_projects >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
