Re: [Analytics] WikimediaBot convention

Nuria Ruiz Mon, 01 Feb 2016 10:45:04 -0800

>In the past, the Analytics team also considered enforcing the convention
by blocking those bots that don't follow it. And that is still an option to
consider.
I would like to point out that I think this is probably the prerogative of
api's team rather than analytics.



>Another option to this thread would be: cancelling the convention and
continue working on regexps
I think regardless of our convention we will always be doing regex
detection of self-identified bots. Maybe I am missing some nuance here?





On Mon, Feb 1, 2016 at 10:42 AM, Nuria Ruiz <[email protected]> wrote:

> >It will take time for frameworks to implement an amended User-Agent
> policy.
> >For example, pywikipedia (pywikibot compat) is not actively
> >maintained.
> That doesn't imply we shouldn't have a  policy that anyone can refer to,
> these bots will not follow it until they get some maintainers.
>
> >There was a task filled against Analytics for this, but Dan Andreescu
> >removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
>
> Sorry that the tagging is confusing. I think Analytics tag was removed
> cause this is a request for data and our team doesn't do data retrieval. We
> normally tag with "analytics" phabricator items that have actionables for
> our team.
> I am cc-ing Bryan who has already done some analysis on bots requests to
> the API and can probably provide some data.
>
>
>
>
> On Mon, Feb 1, 2016 at 6:39 AM, John Mark Vandenberg <[email protected]>
> wrote:
>
>> Hi Marcel,
>>
>> It will take time for frameworks to implement an amended User-Agent
>> policy.
>> For example, pywikipedia (pywikibot compat) is not actively
>> maintained.  We dont know how much traffic is generated by compat.
>> There was a task filled against Analytics for this, but Dan Andreescu
>> removed Analytics (https://phabricator.wikimedia.org/T99373#1859170).
>>
>> There are a lot of clients that need to be upgraded or be
>> decommissioned for this 'add bot' strategy to be effective in the near
>> future.  see https://www.mediawiki.org/wiki/API:Client_code
>>
>> The all important missing step is
>>
>> 3. Create a plan to block clients that dont implement the (amended)
>> User-Agent policy.
>>
>> Without that plan, successfully implemented, you will not get quality
>> data (i.e. using 'Netscape' in the U-A to guess 'human' would perform
>> better).
>>
>> On Tue, Feb 2, 2016 at 1:24 AM, Marcel Ruiz Forns <[email protected]>
>> wrote:
>> > So, trying to join everyone's points of view, what about?
>> >
>> > Using the existing https://meta.wikimedia.org/wiki/User-Agent_policy
>> and
>> > modify it to encourage adding the word "bot" (case-insensitive) to the
>> > User-Agent string, so that it can be easily used to identify bots in the
>> > anlytics cluster (no regexps). And link that page from whatever other
>> pages
>> > we think necessary.
>> >
>> > Do some advertising and outreach and get some bot maintainers and maybe
>> some
>> > frameworks to implement the User-Agent policy. This would make the
>> existing
>> > policy less useless.
>> >
>> > Thanks all for the feedback!
>> >
>> > On Mon, Feb 1, 2016 at 3:16 PM, Marcel Ruiz Forns <[email protected]
>> >
>> > wrote:
>> >>>
>> >>> Clearly Wikipedia et al. uses bot to refer to automated software that
>> >>> edits the site but it seems like you are using the term bot to refer
>> to all
>> >>> automated software and it might be good to clarify.
>> >>
>> >>
>> >> OK, in the documentation we can make that clear. And looking into that,
>> >> I've seen that some bots, in the process of doing their "editing" work
>> can
>> >> also generate pageviews. So we should also include them as potential
>> source
>> >> of pageview traffic. Maybe we can reuse the existing User-Agent policy.
>> >>
>> >>
>> >>> This makes a lot of sense. If I build a bot that crawls wikipedia and
>> >>> facebook public pages it really doesn't make sense that my bot has a
>> >>> "wikimediaBot" user agent, just the word "Bot"  should probably be
>> enough.
>> >>
>> >>
>> >> Totally agree.
>> >>
>> >>
>> >>> I guess a bigger question is why try to differentiate between
>> "spiders"
>> >>> and "bots" at all?
>> >>
>> >>
>> >> I don't think we need to differentiate between "spiders" and "bots".
>> The
>> >> most important question we want to respond is: how much of the traffic
>> we
>> >> consider "human" today is actually "bot". So, +1 "bot"
>> (case-insensitive).
>> >>
>> >>
>> >> On Fri, Jan 29, 2016 at 9:16 PM, John Mark Vandenberg <
>> [email protected]>
>> >> wrote:
>> >>>
>> >>> On 28 Jan 2016 11:28 pm, "Marcel Ruiz Forns" <[email protected]>
>> >>> wrote:
>> >>> >>
>> >>> >> Why not just "Bot", or "MediaWikiBot" which at least encompasses
>> all
>> >>> >> sites that the client
>> >>> >> can communicate with.
>> >>> >
>> >>> > I personally agree with you, "MediaWikiBot" seems to have better
>> >>> > semantics.
>> >>>
>> >>> For clients accessing the MediaWiki api, it is redundant.
>> >>> All it does is identify bots that comply with this edict from
>> analytics.
>> >>>
>> >>> --
>> >>> John Vandenberg
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Analytics mailing list
>> >>> [email protected]
>> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Marcel Ruiz Forns
>> >> Analytics Developer
>> >> Wikimedia Foundation
>> >
>> >
>> >
>> >
>> > --
>> > Marcel Ruiz Forns
>> > Analytics Developer
>> > Wikimedia Foundation
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>>
>>
>> --
>> John Vandenberg
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WikimediaBot convention

Reply via email to