Thanks for the clarification, Nuria :) On 15 March 2015 at 21:59, Nuria Ruiz <[email protected]> wrote: >>What I would recommend is using the new data in wmf.webrequests, which >>gives you, as you say, about 2.5 months, and filtering the user agent; >>there are a couple of UDFs for user agent detection, including >>isSpider, which also looks for wikimedia-specific bots that ua-parser >>ignores. > > So you know adding UA/spider parsing to refined tables is on our backlog of > immediate tasks to do. Until then the data on refined tables is unparsed > (ua-wise) but using the UDFS that Oliver suggested you can benefit from the > new definition. > > > > On Sun, Mar 15, 2015 at 1:05 PM, Oliver Keyes <[email protected]> wrote: >> >> It's not roughly uniform - it varies widely. One of the things I >> identified in my experimentation with methods for detecting automata >> is that a lot of "bad-faith" automated traffic - the stuff that is >> hard to detect even with user agent identification - hits specific >> pages lots and lots of times, not every page once (although there are >> some bots that do that). With the WSC data, which is both non-granular >> and contains no filtering...you're going to have problems. >> >> What I would recommend is using the new data in wmf.webrequests, which >> gives you, as you say, about 2.5 months, and filtering the user agent; >> there are a couple of UDFs for user agent detection, including >> isSpider, which also looks for wikimedia-specific bots that ua-parser >> ignores. There are additional measures and heuristics for identifying >> traffic that is the result of unbalanced automata, which I'm happy to >> talk through with you (a mix of burst detection, heuristics around the >> proportion of traffic to each site version, and concentration >> measures). The burst detection element, at least, should also be >> applicable to the WSC data, so if you find a need for a longer >> timeframe you could always use WSC data but investigate applying that >> - there are some good frameworks out there for doing so. >> >> On 15 March 2015 at 14:47, Leila Zia <[email protected]> wrote: >> > Hi, >> > >> > I'm trying to figure out which of the two pageview definitions we >> > currently have I can use for a question Bob and I are trying to address. >> > It >> > would be great if you share your thoughts. If you choose to do so, >> > please do >> > it by Tuesday, eod, PST. >> > >> > More details: >> > >> > What are we doing? >> > We are building an edit recommendation system that identifies the >> > missing >> > articles in Wikipedia that have a corresponding page in at least one of >> > the >> > top 50 Wikipedia languages, ranks them, and recommends the ranked >> > articles >> > to editors who the algorithm assesses as those who may like to edit the >> > article. >> > >> > Where does pageview definition come into play? >> > When we want to rank missing articles. To do the ranking, we want to >> > consider the pageviews to the article in the languages the article >> > exists >> > in, and using this information estimate what the traffic is expected to >> > be >> > in the language the article is missing in. >> > >> > Why does it matter which pageview definition we use? >> > We would like to use webstatscollector pageview definition since the >> > hourly >> > data we have based on this definition goes back to roughly September >> > 2014. >> > If we go with the new pageview definition, we will have data for the >> > past >> > 2.5 months. The longer period of time we have data for, the better. >> > >> > Why don't you then just use webstatscollector data? >> > We're inclined to do that but we need to make sure that data works for >> > the >> > kind of analysis we want to do. Per discussions with Oliver, >> > webstatscollector data has a lot of pageviews from bots and spiders. The >> > question is: is the effect of bot/spider traffic, i.e., the number of >> > pageviews they add to each page, roughly uniform across all pages? If >> > that >> > is the case, webstatscollector definition will be our choice. >> > >> > I appreciate your thoughts on this. >> > >> > Best, >> > Leila >> > >> > _______________________________________________ >> > Analytics mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
