>What I would recommend is using the new data in wmf.webrequests, which >gives you, as you say, about 2.5 months, and filtering the user agent; >there are a couple of UDFs for user agent detection, including >isSpider, which also looks for wikimedia-specific bots that ua-parser >ignores.
So you know adding UA/spider parsing to refined tables is on our backlog of immediate tasks to do. Until then the data on refined tables is unparsed (ua-wise) but using the UDFS that Oliver suggested you can benefit from the new definition. On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes <[email protected]> wrote: > It's not roughly uniform - it varies widely. One of the things I > identified in my experimentation with methods for detecting automata > is that a lot of "bad-faith" automated traffic - the stuff that is > hard to detect even with user agent identification - hits specific > pages lots and lots of times, not every page once (although there are > some bots that do that). With the WSC data, which is both non-granular > and contains no filtering...you're going to have problems. > > What I would recommend is using the new data in wmf.webrequests, which > gives you, as you say, about 2.5 months, and filtering the user agent; > there are a couple of UDFs for user agent detection, including > isSpider, which also looks for wikimedia-specific bots that ua-parser > ignores. There are additional measures and heuristics for identifying > traffic that is the result of unbalanced automata, which I'm happy to > talk through with you (a mix of burst detection, heuristics around the > proportion of traffic to each site version, and concentration > measures). The burst detection element, at least, should also be > applicable to the WSC data, so if you find a need for a longer > timeframe you could always use WSC data but investigate applying that > - there are some good frameworks out there for doing so. > > On 15 March 2015 at 14:47, Leila Zia <[email protected]> wrote: > > Hi, > > > > I'm trying to figure out which of the two pageview definitions we > > currently have I can use for a question Bob and I are trying to address. > It > > would be great if you share your thoughts. If you choose to do so, > please do > > it by Tuesday, eod, PST. > > > > More details: > > > > What are we doing? > > We are building an edit recommendation system that identifies the missing > > articles in Wikipedia that have a corresponding page in at least one of > the > > top 50 Wikipedia languages, ranks them, and recommends the ranked > articles > > to editors who the algorithm assesses as those who may like to edit the > > article. > > > > Where does pageview definition come into play? > > When we want to rank missing articles. To do the ranking, we want to > > consider the pageviews to the article in the languages the article exists > > in, and using this information estimate what the traffic is expected to > be > > in the language the article is missing in. > > > > Why does it matter which pageview definition we use? > > We would like to use webstatscollector pageview definition since the > hourly > > data we have based on this definition goes back to roughly September > 2014. > > If we go with the new pageview definition, we will have data for the past > > 2.5 months. The longer period of time we have data for, the better. > > > > Why don't you then just use webstatscollector data? > > We're inclined to do that but we need to make sure that data works for > the > > kind of analysis we want to do. Per discussions with Oliver, > > webstatscollector data has a lot of pageviews from bots and spiders. The > > question is: is the effect of bot/spider traffic, i.e., the number of > > pageviews they add to each page, roughly uniform across all pages? If > that > > is the case, webstatscollector definition will be our choice. > > > > I appreciate your thoughts on this. > > > > Best, > > Leila > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
