Re: [Analytics] WikimediaBot convention

Marcel Ruiz Forns Wed, 03 Feb 2016 15:09:05 -0800

Hi again analytics list,

Thank you all for your comments and feedback!
We consider this thread closed and will now proceed to:


   1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy,
   to encourage including (optional) the word "bot" (case-insensitive) in the
   User-Agent string, so that bots that generate pageviews not consumed onsite
   by humans can be easily identified by the Analytics cluster, thus
   increasing accuracy of the human-vs-bot traffic split.

   2. Advertise the convention and reach out to bot/framework maintainers
   to increase the share of bots that implement the User-Agent policy.

   3. The Analytics team should implement the regular expressions that
   match the current User-Agent policy: User-Agent strings with: emails, user
   pages, other mediawiki urls, github urls, and tools.wmflabs.org urls.
   This will take some time, and probably raise technical issues, but seems
   that we can benefit from it. https://phabricator.wikimedia.org/T125731

Cheers!


On Wed, Feb 3, 2016 at 11:43 PM, Marcel Ruiz Forns <[email protected]>
wrote:

> John, thank you a lot for taking the time to answer my question. My
> responses inline (I rearranged some of your paragraphs to respond to them
> together):
>
> I think you need to clearly define what you want to capture and
>> classify, and re-evaluate what change to the user-agent policy will
>> have any noticeable impact on your detection accuracy in the next five
>> years.
>
> &
>
>> If you do not want Googlebot to be grouped together with api based
>> bots , either the user-agent need to use something more distinctive,
>> such as 'MediaWikiBot', or you will need another regex of all the
>> 'bot' matches which you dont want to be a bot.
>
> &
>
>> The `analytics-refinery-source` code currently differentiates between
>> spider and bot, but earlier in this thread you said
>>   'I don't think we need to differentiate between "spiders" and "bots".'
>> If you require 'bot' in the user-agent for bots, this will also
>> capture Googlebot and YandexBot, and many other tools which use 'bot'
>> .  Do you want Googlebot to be a bot?
>> But Yahoo! Slurp's useragent doesnt include bot will not.
>> So you will still need a long regex for user-agents of tools which you
>> can't impose this change onto.
>
> Differentiating between "spiders" and "bots" can be very tricky, as you
> explain. There was some work on it in the past, but what we really want at
> the moment is: to split the human vs bot traffic with a higher accuracy. I
> will add that to the docs, thanks. Regarding measuring the impact, as we'll
> not be able to differentiate "spiders" and "bots", we can only observe the
> variations of the human vs bot traffic rates in time and try to associate
> those to recent changes in User-Agent strings or regular expressions.
>
> The eventual definition of 'bot' will be very central to this issue.
>> Which tools need to start adding 'bot'?  What is 'human' use?  This
>> terminology problem has caused much debate on the wikis, reaching
>> arbcom several times.  So, precision in the definition will be quite
>> helpful.
>
> Agree, will add that to the proposal.
>
> One of the strange area's to consider is jquery-based tools that are
>> effectively bots, performing large numbers of operations on pages in
>> batches with only high-level commands being given by a human.  e.g.
>> the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
>> scripts are also not a 'bot'.
>
> I think the key here is: the program should be tagged as a bot by
> analytics, if it generates pageviews not consumed onsite by a human. I will
> mention that in the docs, too. Thanks.
>
>
>> If gadgets and user-scripts may need to follow the new 'bot' rule of
>> the user-agent policy, the number of developers that need to be
>> engaged is much larger.
>
> &
>
>> Please understand the gravity of what you are imposing.  Changing a
>> user-agent of a client is a breaking change, and any decent MediaWiki
>> client is also used by non-Wikimedia wikis, administrated by
>> non-Wikimedia ops teams, who may have their own tools doing analysis
>> of user-agents hitting their servers, possibly including access
>> control rules.  And their rules and scripts may break when a client
>> framework changes its user-agent in order to make the Wikimedia
>> Analytics scripts easier.  Strictly speaking your user-agent policy
>> proposal requires a new _major_ release for every client framework
>> that you do not grandfather into your proposed user-agent policy.
>
>  &
>
>> If you are only updating the policy to "encourage" the use of the
>> 'bot' in the user-agent, there will not be any concerns as this is
>> quite common anyway, and it is optional. ;-)
>> The dispute will occur when the addition of 'bot' becomes mandatory.
>
> I see your point. The addition of "bot" will be optional (as is the rest
> of the policy), we will make that clear in the docs.
>
> If the proposal is to require only 'bot' in the user-agent,
>> pywikipediabot and pywikibot both need no change to add it (yay!, but
>> do we need to add 'human' to the user-agent for some scripts??), but
>> many client frameworks will still need to change their user-agent,
>> including for example both of the Go frameworks.
>> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
>>
>> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21
>
> &
>
>> By doing some analysis of the existing user-agents hitting your
>> servers, maybe you can find an easy way to grandfather in most client
>> frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
>> frameworks are automatically now also supported.
>
> &
>
>> [[w:User_agent]] says:
>> "Bots, such as Web crawlers, often also include a URL and/or e-mail
>> address so that the Webmaster can contact the operator of the bot."
>> So including URL/email as part of your detection should capture most
>> well written bots.
>> Also including any requests from tools.wmflabs.org and friends as
>> 'bot' might also be a useful improvement.
>
> That is a very good insight. Thanks. Currently, the User-Agent policy is
> not implemented in our regular expressions, meaning: it does not match
> emails, nor user pages or other mediawiki urls. It could also, as you
> suggest, implement matching github accounts, or tools.wmflabs.org. We
> Analytics should tackle that. I will create a task for that and add it to
> the proposal.
>
> Thanks again, in short I'll send the proposal with the changes.
>
> On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg <[email protected]>
> wrote:
>
>> On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <[email protected]>
>> wrote:
>> > Hi all,
>> >
>> > It seems comments are decreasing at this point. I'd like to slowly drive
>> > this thread to a conclusion.
>> >
>> >
>> >> 3. Create a plan to block clients that dont implement the (amended)
>> >> User-Agent policy.
>> >
>> >
>> > I think we can decide on this later. Steps 1) and 2) can be done first -
>> > they should be done anyway before 3) - and then we can see how much
>> benefit
>> > we raise from them. If we don't get a satisfactory reaction from
>> > bot/framework maintainers, we then can go for 3). John, would you be OK
>> with
>> > that?
>>
>> I think you need to clearly define what you want to capture and
>> classify, and re-evaluate what change to the user-agent policy will
>> have any noticeable impact on your detection accuracy in the next five
>> years.
>>
>> The eventual definition of 'bot' will be very central to this issue.
>> Which tools need to start adding 'bot'?  What is 'human' use?  This
>> terminology problem has caused much debate on the wikis, reaching
>> arbcom several times.  So, precision in the definition will be quite
>> helpful.
>>
>> One of the strange area's to consider is jquery-based tools that are
>> effectively bots, performing large numbers of operations on pages in
>> batches with only high-level commands being given by a human.  e.g.
>> the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
>> scripts are also not a 'bot'.
>>
>> If gadgets and user-scripts may need to follow the new 'bot' rule of
>> the user-agent policy, the number of developers that need to be
>> engaged is much larger.
>>
>> If the proposal is to require only 'bot' in the user-agent,
>> pywikipediabot and pywikibot both need no change to add it (yay!, but
>> do we need to add 'human' to the user-agent for some scripts??), but
>> many client frameworks will still need to change their user-agent,
>> including for example both of the Go frameworks.
>> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
>>
>> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21
>>
>> By doing some analysis of the existing user-agents hitting your
>> servers, maybe you can find an easy way to grandfather in most client
>> frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
>> frameworks are automatically now also supported.
>>
>> Please understand the gravity of what you are imposing.  Changing a
>> user-agent of a client is a breaking change, and any decent MediaWiki
>> client is also used by non-Wikimedia wikis, administrated by
>> non-Wikimedia ops teams, who may have their own tools doing analysis
>> of user-agents hitting their servers, possibly including access
>> control rules.  And their rules and scripts may break when a client
>> framework changes its user-agent in order to make the Wikimedia
>> Analytics scripts easier.  Strictly speaking your user-agent policy
>> proposal requires a new _major_ release for every client framework
>> that you do not grandfather into your proposed user-agent policy.
>>
>> Poorly written/single-purpose/once-off clients are less of a problem,
>> as forcing change on them has lower impact.
>>
>> [[w:User_agent]] says:
>>
>> "Bots, such as Web crawlers, often also include a URL and/or e-mail
>> address so that the Webmaster can contact the operator of the bot."
>>
>> So including URL/email as part of your detection should capture most
>> well written bots.
>> Also including any requests from tools.wmflabs.org and friends as
>> 'bot' might also be a useful improvement.
>>
>> The `analytics-refinery-source` code currently differentiates between
>> spider and bot, but earlier in this thread you said
>>
>>   'I don't think we need to differentiate between "spiders" and "bots".'
>>
>> If you require 'bot' in the user-agent for bots, this will also
>> capture Googlebot and YandexBot, and many other tools which use 'bot'
>> .  Do you want Googlebot to be a bot?
>>
>> But Yahoo! Slurp's useragent doesnt include bot will not.
>>
>> So you will still need a long regex for user-agents of tools which you
>> can't impose this change onto.
>>
>> If you do not want Googlebot to be grouped together with api based
>> bots , either the user-agent need to use something more distinctive,
>> such as 'MediaWikiBot', or you will need another regex of all the
>> 'bot' matches which you dont want to be a bot.
>>
>> > If no-one else raises concerns about this, the Analytics team will:
>> >
>> > Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to
>> > encourage including the word "bot" (case-insensitive) in the User-Agent
>> > string, so that bots can be easily identified.
>>
>> If you are only updating the policy to "encourage" the use of the
>> 'bot' in the user-agent, there will not be any concerns as this is
>> quite common anyway, and it is optional. ;-)
>>
>> The dispute will occur when the addition of 'bot' becomes mandatory.
>>
>> --
>> John Vandenberg
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> *Marcel Ruiz Forns*
> Analytics Developer
> Wikimedia Foundation
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WikimediaBot convention

Reply via email to