> which gives you, as you say, about 2.5 months Last week we started deleting older wmf.webrequest data. We currently keep 62 days.
> On Mar 15, 2015, at 16:05, Oliver Keyes <[email protected]> wrote: > > It's not roughly uniform - it varies widely. One of the things I > identified in my experimentation with methods for detecting automata > is that a lot of "bad-faith" automated traffic - the stuff that is > hard to detect even with user agent identification - hits specific > pages lots and lots of times, not every page once (although there are > some bots that do that). With the WSC data, which is both non-granular > and contains no filtering...you're going to have problems. > > What I would recommend is using the new data in wmf.webrequests, which > gives you, as you say, about 2.5 months, and filtering the user agent; > there are a couple of UDFs for user agent detection, including > isSpider, which also looks for wikimedia-specific bots that ua-parser > ignores. There are additional measures and heuristics for identifying > traffic that is the result of unbalanced automata, which I'm happy to > talk through with you (a mix of burst detection, heuristics around the > proportion of traffic to each site version, and concentration > measures). The burst detection element, at least, should also be > applicable to the WSC data, so if you find a need for a longer > timeframe you could always use WSC data but investigate applying that > - there are some good frameworks out there for doing so. > > On 15 March 2015 at 14:47, Leila Zia <[email protected]> wrote: >> Hi, >> >> I'm trying to figure out which of the two pageview definitions we >> currently have I can use for a question Bob and I are trying to address. It >> would be great if you share your thoughts. If you choose to do so, please do >> it by Tuesday, eod, PST. >> >> More details: >> >> What are we doing? >> We are building an edit recommendation system that identifies the missing >> articles in Wikipedia that have a corresponding page in at least one of the >> top 50 Wikipedia languages, ranks them, and recommends the ranked articles >> to editors who the algorithm assesses as those who may like to edit the >> article. >> >> Where does pageview definition come into play? >> When we want to rank missing articles. To do the ranking, we want to >> consider the pageviews to the article in the languages the article exists >> in, and using this information estimate what the traffic is expected to be >> in the language the article is missing in. >> >> Why does it matter which pageview definition we use? >> We would like to use webstatscollector pageview definition since the hourly >> data we have based on this definition goes back to roughly September 2014. >> If we go with the new pageview definition, we will have data for the past >> 2.5 months. The longer period of time we have data for, the better. >> >> Why don't you then just use webstatscollector data? >> We're inclined to do that but we need to make sure that data works for the >> kind of analysis we want to do. Per discussions with Oliver, >> webstatscollector data has a lot of pageviews from bots and spiders. The >> question is: is the effect of bot/spider traffic, i.e., the number of >> pageviews they add to each page, roughly uniform across all pages? If that >> is the case, webstatscollector definition will be our choice. >> >> I appreciate your thoughts on this. >> >> Best, >> Leila >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
