>What I would recommend is using the new data in wmf.webrequests, which
>gives you, as you say, about 2.5 months, and filtering the user agent;
>there are a couple of UDFs for user agent detection, including
>isSpider, which also looks for wikimedia-specific bots that ua-parser
>ignores.

So you know adding UA/spider parsing to refined tables is on our backlog of
immediate tasks to do. Until then the data on refined tables is unparsed
(ua-wise) but using the UDFS that Oliver suggested you can benefit from the
new definition.



On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes <[email protected]> wrote:

> It's not roughly uniform - it varies widely. One of the things I
> identified in my experimentation with methods for detecting automata
> is that a lot of "bad-faith" automated traffic - the stuff that is
> hard to detect even with user agent identification - hits specific
> pages lots and lots of times, not every page once (although there are
> some bots that do that). With the WSC data, which is both non-granular
> and contains no filtering...you're going to have problems.
>
> What I would recommend is using the new data in wmf.webrequests, which
> gives you, as you say, about 2.5 months, and filtering the user agent;
> there are a couple of UDFs for user agent detection, including
> isSpider, which also looks for wikimedia-specific bots that ua-parser
> ignores. There are additional measures and heuristics for identifying
> traffic that is the result of unbalanced automata, which I'm happy to
> talk through with you (a mix of burst detection, heuristics around the
> proportion of traffic to each site version, and concentration
> measures). The burst detection element, at least, should also be
> applicable to the WSC data, so if you find a need for a longer
> timeframe you could always use WSC data but investigate applying that
> - there are some good frameworks out there for doing so.
>
> On 15 March 2015 at 14:47, Leila Zia <[email protected]> wrote:
> > Hi,
> >
> >    I'm trying to figure out which of the two pageview definitions we
> > currently have I can use for a question Bob and I are trying to address.
> It
> > would be great if you share your thoughts. If you choose to do so,
> please do
> > it by Tuesday, eod, PST.
> >
> > More details:
> >
> > What are we doing?
> > We are building an edit recommendation system that identifies the missing
> > articles in Wikipedia that have a corresponding page in at least one of
> the
> > top 50 Wikipedia languages, ranks them, and recommends the ranked
> articles
> > to editors who the algorithm assesses as those who may like to edit the
> > article.
> >
> > Where does pageview definition come into play?
> > When we want to rank missing articles. To do the ranking, we want to
> > consider the pageviews to the article in the languages the article exists
> > in, and using this information estimate what the traffic is expected to
> be
> > in the language the article is missing in.
> >
> > Why does it matter which pageview definition we use?
> > We would like to use webstatscollector pageview definition since the
> hourly
> > data we have based on this definition goes back to roughly September
> 2014.
> > If we go with the new pageview definition, we will have data for the past
> > 2.5 months. The longer period of time we have data for, the better.
> >
> > Why don't you then just use webstatscollector data?
> > We're inclined to do that but we need to make sure that data works for
> the
> > kind of analysis we want to do. Per discussions with Oliver,
> > webstatscollector data has a lot of pageviews from bots and spiders. The
> > question is: is the effect of bot/spider traffic, i.e., the number of
> > pageviews they add to each page, roughly uniform across all pages? If
> that
> > is the case, webstatscollector definition will be our choice.
> >
> > I appreciate your thoughts on this.
> >
> > Best,
> > Leila
> >
> > _______________________________________________
> > Analytics mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to