Thanks for the clarification, Nuria :)

On 15 March 2015 at 21:59, Nuria Ruiz <[email protected]> wrote:
>>What I would recommend is using the new data in wmf.webrequests, which
>>gives you, as you say, about 2.5 months, and filtering the user agent;
>>there are a couple of UDFs for user agent detection, including
>>isSpider, which also looks for wikimedia-specific bots that ua-parser
>>ignores.
>
> So you know adding UA/spider parsing to refined tables is on our backlog of
> immediate tasks to do. Until then the data on refined tables is unparsed
> (ua-wise) but using the UDFS that Oliver suggested you can benefit from the
> new definition.
>
>
>
> On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes <[email protected]> wrote:
>>
>> It's not roughly uniform - it varies widely. One of the things I
>> identified in my experimentation with methods for detecting automata
>> is that a lot of "bad-faith" automated traffic - the stuff that is
>> hard to detect even with user agent identification - hits specific
>> pages lots and lots of times, not every page once (although there are
>> some bots that do that). With the WSC data, which is both non-granular
>> and contains no filtering...you're going to have problems.
>>
>> What I would recommend is using the new data in wmf.webrequests, which
>> gives you, as you say, about 2.5 months, and filtering the user agent;
>> there are a couple of UDFs for user agent detection, including
>> isSpider, which also looks for wikimedia-specific bots that ua-parser
>> ignores. There are additional measures and heuristics for identifying
>> traffic that is the result of unbalanced automata, which I'm happy to
>> talk through with you (a mix of burst detection, heuristics around the
>> proportion of traffic to each site version, and concentration
>> measures). The burst detection element, at least, should also be
>> applicable to the WSC data, so if you find a need for a longer
>> timeframe you could always use WSC data but investigate applying that
>> - there are some good frameworks out there for doing so.
>>
>> On 15 March 2015 at 14:47, Leila Zia <[email protected]> wrote:
>> > Hi,
>> >
>> >    I'm trying to figure out which of the two pageview definitions we
>> > currently have I can use for a question Bob and I are trying to address.
>> > It
>> > would be great if you share your thoughts. If you choose to do so,
>> > please do
>> > it by Tuesday, eod, PST.
>> >
>> > More details:
>> >
>> > What are we doing?
>> > We are building an edit recommendation system that identifies the
>> > missing
>> > articles in Wikipedia that have a corresponding page in at least one of
>> > the
>> > top 50 Wikipedia languages, ranks them, and recommends the ranked
>> > articles
>> > to editors who the algorithm assesses as those who may like to edit the
>> > article.
>> >
>> > Where does pageview definition come into play?
>> > When we want to rank missing articles. To do the ranking, we want to
>> > consider the pageviews to the article in the languages the article
>> > exists
>> > in, and using this information estimate what the traffic is expected to
>> > be
>> > in the language the article is missing in.
>> >
>> > Why does it matter which pageview definition we use?
>> > We would like to use webstatscollector pageview definition since the
>> > hourly
>> > data we have based on this definition goes back to roughly September
>> > 2014.
>> > If we go with the new pageview definition, we will have data for the
>> > past
>> > 2.5 months. The longer period of time we have data for, the better.
>> >
>> > Why don't you then just use webstatscollector data?
>> > We're inclined to do that but we need to make sure that data works for
>> > the
>> > kind of analysis we want to do. Per discussions with Oliver,
>> > webstatscollector data has a lot of pageviews from bots and spiders. The
>> > question is: is the effect of bot/spider traffic, i.e., the number of
>> > pageviews they add to each page, roughly uniform across all pages? If
>> > that
>> > is the case, webstatscollector definition will be our choice.
>> >
>> > I appreciate your thoughts on this.
>> >
>> > Best,
>> > Leila
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to