Thanks, John, for your explanation. I can totally see your point.

In the past, the Analytics team also considered enforcing the convention by
blocking those bots that don't follow it. And that is still an option to
consider.

One question to all, though: Are we consdering bots that scrape Wikimedia
sites directly (not using the api)? And are we considering humans that
manually use the api? It's very difficult to identify those, and we could
easily end up with false positives and blocking legitimate requests, no?

Another option to this thread would be: cancelling the convention and
continue working on regexps and other analyses (like the one made for
last-access devices).

More thoughts?

On Mon, Feb 1, 2016 at 3:39 PM, John Mark Vandenberg <[email protected]>
wrote:

> Hi Marcel,
>
> It will take time for frameworks to implement an amended User-Agent policy.
> For example, pywikipedia (pywikibot compat) is not actively
> maintained.  We dont know how much traffic is generated by compat.
> There was a task filled against Analytics for this, but Dan Andreescu
> removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
>
> There are a lot of clients that need to be upgraded or be
> decommissioned for this 'add bot' strategy to be effective in the near
> future.  see https://www.mediawiki.org/wiki/API:Client_code
>
> The all important missing step is
>
> 3. Create a plan to block clients that dont implement the (amended)
> User-Agent policy.
>
> Without that plan, successfully implemented, you will not get quality
> data (i.e. using 'Netscape' in the U-A to guess 'human' would perform
> better).
>
> On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns <[email protected]>
> wrote:
> > So, trying to join everyone's points of view, what about?
> >
> > Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy and
> > modify it to encourage adding the word "bot" (case-insensitive) to the
> > User-Agent string, so that it can be easily used to identify bots in the
> > anlytics cluster (no regexps). And link that page from whatever other
> pages
> > we think necessary.
> >
> > Do some advertising and outreach and get some bot maintainers and maybe
> some
> > frameworks to implement the User-Agent policy. This would make the
> existing
> > policy less useless.
> >
> > Thanks all for the feedback!
> >
> > On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <[email protected]>
> > wrote:
> >>>
> >>> Clearly Wikipedia et al. uses bot to refer to automated software that
> >>> edits the site but it seems like you are using the term bot to refer
> to all
> >>> automated software and it might be good to clarify.
> >>
> >>
> >> OK, in the documentation we can make that clear. And looking into that,
> >> I've seen that some bots, in the process of doing their "editing" work
> can
> >> also generate pageviews. So we should also include them as potential
> source
> >> of pageview traffic. Maybe we can reuse the existing User-Agent policy.
> >>
> >>
> >>> This makes a lot of sense. If I build a bot that crawls wikipedia and
> >>> facebook public pages it really doesn't make sense that my bot has a
> >>> "wikimediaBot" user agent, just the word "Bot"  should probably be
> enough.
> >>
> >>
> >> Totally agree.
> >>
> >>
> >>> I guess a bigger question is why try to differentiate between "spiders"
> >>> and "bots" at all?
> >>
> >>
> >> I don't think we need to differentiate between "spiders" and "bots". The
> >> most important question we want to respond is: how much of the traffic
> we
> >> consider "human" today is actually "bot". So, +1 "bot"
> (case-insensitive).
> >>
> >>
> >> On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <[email protected]
> >
> >> wrote:
> >>>
> >>> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <[email protected]>
> >>> wrote:
> >>> >>
> >>> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses all
> >>> >> sites that the client
> >>> >> can communicate with.
> >>> >
> >>> > I personally agree with you, "MediaWikiBot" seems to have better
> >>> > semantics.
> >>>
> >>> For clients accessing the MediaWiki api, it is redundant.
> >>> All it does is identify bots that comply with this edict from
> analytics.
> >>>
> >>> --
> >>> John Vandenberg
> >>>
> >>>
> >>> _______________________________________________
> >>> Analytics mailing list
> >>> [email protected]
> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>
> >>
> >>
> >>
> >> --
> >> Marcel Ruiz Forns
> >> Analytics Developer
> >> Wikimedia Foundation
> >
> >
> >
> >
> > --
> > Marcel Ruiz Forns
> > Analytics Developer
> > Wikimedia Foundation
> >
> > _______________________________________________
> > Analytics mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> John Vandenberg
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to