Hi again analytics list, Thank you all for your comments and feedback! We consider this thread closed and will now proceed to:
1. Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to encourage including (optional) the word "bot" (case-insensitive) in the User-Agent string, so that bots that generate pageviews not consumed onsite by humans can be easily identified by the Analytics cluster, thus increasing accuracy of the human-vs-bot traffic split. 2. Advertise the convention and reach out to bot/framework maintainers to increase the share of bots that implement the User-Agent policy. 3. The Analytics team should implement the regular expressions that match the current User-Agent policy: User-Agent strings with: emails, user pages, other mediawiki urls, github urls, and tools.wmflabs.org urls. This will take some time, and probably raise technical issues, but seems that we can benefit from it. https://phabricator.wikimedia.org/T125731 Cheers! On Wed, Feb 3, 2016 at 11:43 PM, Marcel Ruiz Forns <[email protected]> wrote: > John, thank you a lot for taking the time to answer my question. My > responses inline (I rearranged some of your paragraphs to respond to them > together): > > I think you need to clearly define what you want to capture and >> classify, and re-evaluate what change to the user-agent policy will >> have any noticeable impact on your detection accuracy in the next five >> years. > > & > >> If you do not want Googlebot to be grouped together with api based >> bots , either the user-agent need to use something more distinctive, >> such as 'MediaWikiBot', or you will need another regex of all the >> 'bot' matches which you dont want to be a bot. > > & > >> The `analytics-refinery-source` code currently differentiates between >> spider and bot, but earlier in this thread you said >> 'I don't think we need to differentiate between "spiders" and "bots".' >> If you require 'bot' in the user-agent for bots, this will also >> capture Googlebot and YandexBot, and many other tools which use 'bot' >> . Do you want Googlebot to be a bot? >> But Yahoo! Slurp's useragent doesnt include bot will not. >> So you will still need a long regex for user-agents of tools which you >> can't impose this change onto. > > Differentiating between "spiders" and "bots" can be very tricky, as you > explain. There was some work on it in the past, but what we really want at > the moment is: to split the human vs bot traffic with a higher accuracy. I > will add that to the docs, thanks. Regarding measuring the impact, as we'll > not be able to differentiate "spiders" and "bots", we can only observe the > variations of the human vs bot traffic rates in time and try to associate > those to recent changes in User-Agent strings or regular expressions. > > The eventual definition of 'bot' will be very central to this issue. >> Which tools need to start adding 'bot'? What is 'human' use? This >> terminology problem has caused much debate on the wikis, reaching >> arbcom several times. So, precision in the definition will be quite >> helpful. > > Agree, will add that to the proposal. > > One of the strange area's to consider is jquery-based tools that are >> effectively bots, performing large numbers of operations on pages in >> batches with only high-level commands being given by a human. e.g. >> the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot >> scripts are also not a 'bot'. > > I think the key here is: the program should be tagged as a bot by > analytics, if it generates pageviews not consumed onsite by a human. I will > mention that in the docs, too. Thanks. > > >> If gadgets and user-scripts may need to follow the new 'bot' rule of >> the user-agent policy, the number of developers that need to be >> engaged is much larger. > > & > >> Please understand the gravity of what you are imposing. Changing a >> user-agent of a client is a breaking change, and any decent MediaWiki >> client is also used by non-Wikimedia wikis, administrated by >> non-Wikimedia ops teams, who may have their own tools doing analysis >> of user-agents hitting their servers, possibly including access >> control rules. And their rules and scripts may break when a client >> framework changes its user-agent in order to make the Wikimedia >> Analytics scripts easier. Strictly speaking your user-agent policy >> proposal requires a new _major_ release for every client framework >> that you do not grandfather into your proposed user-agent policy. > > & > >> If you are only updating the policy to "encourage" the use of the >> 'bot' in the user-agent, there will not be any concerns as this is >> quite common anyway, and it is optional. ;-) >> The dispute will occur when the addition of 'bot' becomes mandatory. > > I see your point. The addition of "bot" will be optional (as is the rest > of the policy), we will make that clear in the docs. > > If the proposal is to require only 'bot' in the user-agent, >> pywikipediabot and pywikibot both need no change to add it (yay!, but >> do we need to add 'human' to the user-agent for some scripts??), but >> many client frameworks will still need to change their user-agent, >> including for example both of the Go frameworks. >> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163 >> >> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21 > > & > >> By doing some analysis of the existing user-agents hitting your >> servers, maybe you can find an easy way to grandfather in most client >> frameworks. e.g. if you also add 'github' as a bot pattern, both Go >> frameworks are automatically now also supported. > > & > >> [[w:User_agent]] says: >> "Bots, such as Web crawlers, often also include a URL and/or e-mail >> address so that the Webmaster can contact the operator of the bot." >> So including URL/email as part of your detection should capture most >> well written bots. >> Also including any requests from tools.wmflabs.org and friends as >> 'bot' might also be a useful improvement. > > That is a very good insight. Thanks. Currently, the User-Agent policy is > not implemented in our regular expressions, meaning: it does not match > emails, nor user pages or other mediawiki urls. It could also, as you > suggest, implement matching github accounts, or tools.wmflabs.org. We > Analytics should tackle that. I will create a task for that and add it to > the proposal. > > Thanks again, in short I'll send the proposal with the changes. > > On Wed, Feb 3, 2016 at 1:00 AM, John Mark Vandenberg <[email protected]> > wrote: > >> On Wed, Feb 3, 2016 at 6:40 AM, Marcel Ruiz Forns <[email protected]> >> wrote: >> > Hi all, >> > >> > It seems comments are decreasing at this point. I'd like to slowly drive >> > this thread to a conclusion. >> > >> > >> >> 3. Create a plan to block clients that dont implement the (amended) >> >> User-Agent policy. >> > >> > >> > I think we can decide on this later. Steps 1) and 2) can be done first - >> > they should be done anyway before 3) - and then we can see how much >> benefit >> > we raise from them. If we don't get a satisfactory reaction from >> > bot/framework maintainers, we then can go for 3). John, would you be OK >> with >> > that? >> >> I think you need to clearly define what you want to capture and >> classify, and re-evaluate what change to the user-agent policy will >> have any noticeable impact on your detection accuracy in the next five >> years. >> >> The eventual definition of 'bot' will be very central to this issue. >> Which tools need to start adding 'bot'? What is 'human' use? This >> terminology problem has caused much debate on the wikis, reaching >> arbcom several times. So, precision in the definition will be quite >> helpful. >> >> One of the strange area's to consider is jquery-based tools that are >> effectively bots, performing large numbers of operations on pages in >> batches with only high-level commands being given by a human. e.g. >> the gadget Cat-a-Lot. If those are not a 'bot', then many pywikibot >> scripts are also not a 'bot'. >> >> If gadgets and user-scripts may need to follow the new 'bot' rule of >> the user-agent policy, the number of developers that need to be >> engaged is much larger. >> >> If the proposal is to require only 'bot' in the user-agent, >> pywikipediabot and pywikibot both need no change to add it (yay!, but >> do we need to add 'human' to the user-agent for some scripts??), but >> many client frameworks will still need to change their user-agent, >> including for example both of the Go frameworks. >> https://github.com/sadbox/mediawiki/blob/master/mediawiki.go#L163 >> >> https://github.com/cgt/go-mwclient/blob/d40301c3a6ca46f614bce5d283fe4fe762ad7205/core.go#L21 >> >> By doing some analysis of the existing user-agents hitting your >> servers, maybe you can find an easy way to grandfather in most client >> frameworks. e.g. if you also add 'github' as a bot pattern, both Go >> frameworks are automatically now also supported. >> >> Please understand the gravity of what you are imposing. Changing a >> user-agent of a client is a breaking change, and any decent MediaWiki >> client is also used by non-Wikimedia wikis, administrated by >> non-Wikimedia ops teams, who may have their own tools doing analysis >> of user-agents hitting their servers, possibly including access >> control rules. And their rules and scripts may break when a client >> framework changes its user-agent in order to make the Wikimedia >> Analytics scripts easier. Strictly speaking your user-agent policy >> proposal requires a new _major_ release for every client framework >> that you do not grandfather into your proposed user-agent policy. >> >> Poorly written/single-purpose/once-off clients are less of a problem, >> as forcing change on them has lower impact. >> >> [[w:User_agent]] says: >> >> "Bots, such as Web crawlers, often also include a URL and/or e-mail >> address so that the Webmaster can contact the operator of the bot." >> >> So including URL/email as part of your detection should capture most >> well written bots. >> Also including any requests from tools.wmflabs.org and friends as >> 'bot' might also be a useful improvement. >> >> The `analytics-refinery-source` code currently differentiates between >> spider and bot, but earlier in this thread you said >> >> 'I don't think we need to differentiate between "spiders" and "bots".' >> >> If you require 'bot' in the user-agent for bots, this will also >> capture Googlebot and YandexBot, and many other tools which use 'bot' >> . Do you want Googlebot to be a bot? >> >> But Yahoo! Slurp's useragent doesnt include bot will not. >> >> So you will still need a long regex for user-agents of tools which you >> can't impose this change onto. >> >> If you do not want Googlebot to be grouped together with api based >> bots , either the user-agent need to use something more distinctive, >> such as 'MediaWikiBot', or you will need another regex of all the >> 'bot' matches which you dont want to be a bot. >> >> > If no-one else raises concerns about this, the Analytics team will: >> > >> > Add a mention to https://meta.wikimedia.org/wiki/User-Agent_policy, to >> > encourage including the word "bot" (case-insensitive) in the User-Agent >> > string, so that bots can be easily identified. >> >> If you are only updating the policy to "encourage" the use of the >> 'bot' in the user-agent, there will not be any concerns as this is >> quite common anyway, and it is optional. ;-) >> >> The dispute will occur when the addition of 'bot' becomes mandatory. >> >> -- >> John Vandenberg >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > *Marcel Ruiz Forns* > Analytics Developer > Wikimedia Foundation > -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
