> which gives you, as you say, about 2.5 months

Last week we started deleting older wmf.webrequest data.  We currently keep 62 
days.



> On Mar 15, 2015, at 16:05, Oliver Keyes <[email protected]> wrote:
> 
> It's not roughly uniform - it varies widely. One of the things I
> identified in my experimentation with methods for detecting automata
> is that a lot of "bad-faith" automated traffic - the stuff that is
> hard to detect even with user agent identification - hits specific
> pages lots and lots of times, not every page once (although there are
> some bots that do that). With the WSC data, which is both non-granular
> and contains no filtering...you're going to have problems.
> 
> What I would recommend is using the new data in wmf.webrequests, which
> gives you, as you say, about 2.5 months, and filtering the user agent;
> there are a couple of UDFs for user agent detection, including
> isSpider, which also looks for wikimedia-specific bots that ua-parser
> ignores. There are additional measures and heuristics for identifying
> traffic that is the result of unbalanced automata, which I'm happy to
> talk through with you (a mix of burst detection, heuristics around the
> proportion of traffic to each site version, and concentration
> measures). The burst detection element, at least, should also be
> applicable to the WSC data, so if you find a need for a longer
> timeframe you could always use WSC data but investigate applying that
> - there are some good frameworks out there for doing so.
> 
> On 15 March 2015 at 14:47, Leila Zia <[email protected]> wrote:
>> Hi,
>> 
>>   I'm trying to figure out which of the two pageview definitions we
>> currently have I can use for a question Bob and I are trying to address. It
>> would be great if you share your thoughts. If you choose to do so, please do
>> it by Tuesday, eod, PST.
>> 
>> More details:
>> 
>> What are we doing?
>> We are building an edit recommendation system that identifies the missing
>> articles in Wikipedia that have a corresponding page in at least one of the
>> top 50 Wikipedia languages, ranks them, and recommends the ranked articles
>> to editors who the algorithm assesses as those who may like to edit the
>> article.
>> 
>> Where does pageview definition come into play?
>> When we want to rank missing articles. To do the ranking, we want to
>> consider the pageviews to the article in the languages the article exists
>> in, and using this information estimate what the traffic is expected to be
>> in the language the article is missing in.
>> 
>> Why does it matter which pageview definition we use?
>> We would like to use webstatscollector pageview definition since the hourly
>> data we have based on this definition goes back to roughly September 2014.
>> If we go with the new pageview definition, we will have data for the past
>> 2.5 months. The longer period of time we have data for, the better.
>> 
>> Why don't you then just use webstatscollector data?
>> We're inclined to do that but we need to make sure that data works for the
>> kind of analysis we want to do. Per discussions with Oliver,
>> webstatscollector data has a lot of pageviews from bots and spiders. The
>> question is: is the effect of bot/spider traffic, i.e., the number of
>> pageviews they add to each page, roughly uniform across all pages? If that
>> is the case, webstatscollector definition will be our choice.
>> 
>> I appreciate your thoughts on this.
>> 
>> Best,
>> Leila
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to