Re: [Analytics] WikimediaBot convention

Marcel Ruiz Forns Wed, 03 Feb 2016 14:44:28 -0800

John, thank you a lot for taking the time to answer my question. My
responses inline (I rearranged some of your paragraphs to respond to them
together):


I think you need to clearly define what you want to capture and
> classify, and re-evaluate what change to the user-agent policy will
> have any noticeable impact on your detection accuracy in the next five
> years.

&

> If you do not want Googlebot to be grouped together with api based
> bots , either the user-agent need to use something more distinctive,
> such as 'MediaWikiBot', or you will need another regex of all the
> 'bot' matches which you dont want to be a bot.

&

> The `analytics-refinery-source` code currently differentiates between
> spider and bot, but earlier in this thread you said
>   'I don't think we need to differentiate between "spiders" and "bots".'
> If you require 'bot' in the user-agent for bots, this will also
> capture Googlebot and YandexBot, and many other tools which use 'bot'
> .  Do you want Googlebot to be a bot?
> But Yahoo! Slurp's useragent doesnt include bot will not.
> So you will still need a long regex for user-agents of tools which you
> can't impose this change onto.

Differentiating between "spiders" and "bots" can be very tricky, as you
explain. There was some work on it in the past, but what we really want at
the moment is: to split the human vs bot traffic with a higher accuracy. I
will add that to the docs, thanks. Regarding measuring the impact, as we'll
not be able to differentiate "spiders" and "bots", we can only observe the
variations of the human vs bot traffic rates in time and try to associate
those to recent changes in User-Agent strings or regular expressions.

The eventual definition of 'bot' will be very central to this issue.
> Which tools need to start adding 'bot'?  What is 'human' use?  This
> terminology problem has caused much debate on the wikis, reaching
> arbcom several times.  So, precision in the definition will be quite
> helpful.

Agree, will add that to the proposal.

One of the strange area's to consider is jquery-based tools that are
> effectively bots, performing large numbers of operations on pages in
> batches with only high-level commands being given by a human.  e.g.
> the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
> scripts are also not a 'bot'.

I think the key here is: the program should be tagged as a bot by
analytics, if it generates pageviews not consumed onsite by a human. I will
mention that in the docs, too. Thanks.


> If gadgets and user-scripts may need to follow the new 'bot' rule of
> the user-agent policy, the number of developers that need to be
> engaged is much larger.

&

> Please understand the gravity of what you are imposing.  Changing a
> user-agent of a client is a breaking change, and any decent MediaWiki
> client is also used by non-Wikimedia wikis, administrated by
> non-Wikimedia ops teams, who may have their own tools doing analysis
> of user-agents hitting their servers, possibly including access
> control rules.  And their rules and scripts may break when a client
> framework changes its user-agent in order to make the Wikimedia
> Analytics scripts easier.  Strictly speaking your user-agent policy
> proposal requires a new _major_ release for every client framework
> that you do not grandfather into your proposed user-agent policy.

 &

> If you are only updating the policy to "encourage" the use of the
> 'bot' in the user-agent, there will not be any concerns as this is
> quite common anyway, and it is optional. ;-)
> The dispute will occur when the addition of 'bot' becomes mandatory.

I see your point. The addition of "bot" will be optional (as is the rest of
the policy), we will make that clear in the docs.

If the proposal is to require only 'bot' in the user-agent,
> pywikipediabot and pywikibot both need no change to add it (yay!, but
> do we need to add 'human' to the user-agent for some scripts??), but
> many client frameworks will still need to change their user-agent,
> including for example both of the Go frameworks.
> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
>
> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21

&

> By doing some analysis of the existing user-agents hitting your
> servers, maybe you can find an easy way to grandfather in most client
> frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
> frameworks are automatically now also supported.

&

> [[w:User_agent]] says:
> "Bots, such as Web crawlers, often also include a URL and/or e-mail
> address so that the Webmaster can contact the operator of the bot."
> So including URL/email as part of your detection should capture most
> well written bots.
> Also including any requests from tools.wmflabs.org and friends as
> 'bot' might also be a useful improvement.

That is a very good insight. Thanks. Currently, the User-Agent policy is
not implemented in our regular expressions, meaning: it does not match
emails, nor user pages or other mediawiki urls. It could also, as you
suggest, implement matching github accounts, or tools.wmflabs.org. We
Analytics should tackle that. I will create a task for that and add it to
the proposal.

Thanks again, in short I'll send the proposal with the changes.

On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg <[email protected]>
wrote:

> On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <[email protected]>
> wrote:
> > Hi all,
> >
> > It seems comments are decreasing at this point. I'd like to slowly drive
> > this thread to a conclusion.
> >
> >
> >> 3. Create a plan to block clients that dont implement the (amended)
> >> User-Agent policy.
> >
> >
> > I think we can decide on this later. Steps 1) and 2) can be done first -
> > they should be done anyway before 3) - and then we can see how much
> benefit
> > we raise from them. If we don't get a satisfactory reaction from
> > bot/framework maintainers, we then can go for 3). John, would you be OK
> with
> > that?
>
> I think you need to clearly define what you want to capture and
> classify, and re-evaluate what change to the user-agent policy will
> have any noticeable impact on your detection accuracy in the next five
> years.
>
> The eventual definition of 'bot' will be very central to this issue.
> Which tools need to start adding 'bot'?  What is 'human' use?  This
> terminology problem has caused much debate on the wikis, reaching
> arbcom several times.  So, precision in the definition will be quite
> helpful.
>
> One of the strange area's to consider is jquery-based tools that are
> effectively bots, performing large numbers of operations on pages in
> batches with only high-level commands being given by a human.  e.g.
> the gadget Cat-a-Lot.  If those are not a 'bot', then many pywikibot
> scripts are also not a 'bot'.
>
> If gadgets and user-scripts may need to follow the new 'bot' rule of
> the user-agent policy, the number of developers that need to be
> engaged is much larger.
>
> If the proposal is to require only 'bot' in the user-agent,
> pywikipediabot and pywikibot both need no change to add it (yay!, but
> do we need to add 'human' to the user-agent for some scripts??), but
> many client frameworks will still need to change their user-agent,
> including for example both of the Go frameworks.
> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163
>
> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21
>
> By doing some analysis of the existing user-agents hitting your
> servers, maybe you can find an easy way to grandfather in most client
> frameworks.   e.g. if you also add 'github' as a bot pattern, both Go
> frameworks are automatically now also supported.
>
> Please understand the gravity of what you are imposing.  Changing a
> user-agent of a client is a breaking change, and any decent MediaWiki
> client is also used by non-Wikimedia wikis, administrated by
> non-Wikimedia ops teams, who may have their own tools doing analysis
> of user-agents hitting their servers, possibly including access
> control rules.  And their rules and scripts may break when a client
> framework changes its user-agent in order to make the Wikimedia
> Analytics scripts easier.  Strictly speaking your user-agent policy
> proposal requires a new _major_ release for every client framework
> that you do not grandfather into your proposed user-agent policy.
>
> Poorly written/single-purpose/once-off clients are less of a problem,
> as forcing change on them has lower impact.
>
> [[w:User_agent]] says:
>
> "Bots, such as Web crawlers, often also include a URL and/or e-mail
> address so that the Webmaster can contact the operator of the bot."
>
> So including URL/email as part of your detection should capture most
> well written bots.
> Also including any requests from tools.wmflabs.org and friends as
> 'bot' might also be a useful improvement.
>
> The `analytics-refinery-source` code currently differentiates between
> spider and bot, but earlier in this thread you said
>
>   'I don't think we need to differentiate between "spiders" and "bots".'
>
> If you require 'bot' in the user-agent for bots, this will also
> capture Googlebot and YandexBot, and many other tools which use 'bot'
> .  Do you want Googlebot to be a bot?
>
> But Yahoo! Slurp's useragent doesnt include bot will not.
>
> So you will still need a long regex for user-agents of tools which you
> can't impose this change onto.
>
> If you do not want Googlebot to be grouped together with api based
> bots , either the user-agent need to use something more distinctive,
> such as 'MediaWikiBot', or you will need another regex of all the
> 'bot' matches which you dont want to be a bot.
>
> > If no-one else raises concerns about this, the Analytics team will:
> >
> > Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to
> > encourage including the word "bot" (case-insensitive) in the User-Agent
> > string, so that bots can be easily identified.
>
> If you are only updating the policy to "encourage" the use of the
> 'bot' in the user-agent, there will not be any concerns as this is
> quite common anyway, and it is optional. ;-)
>
> The dispute will occur when the addition of 'bot' becomes mandatory.
>
> --
> John Vandenberg
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WikimediaBot convention

Reply via email to