We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <[email protected]> wrote: > > Hi Aaron, all -- > > I haven't seen any discussion on this which is a sign that we can forward > with turning over the draft. Thoughts? > > thanks, > > -Toby > > On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <[email protected]> > wrote: > >> Hey folks, >> >> As discussions on the new page view definition have been calming down, >> we're preparing to deliver a draft version to the Devs. I want to make >> sure that we all know the status and that any substantial concerns are >> raised before we hand things off on *Friday, Dec 12th.* >> >> For this phase, we are delivering the general filter[1]. This is the >> highest level filter, and exists primarily to distinguish requests worthy >> of further evaluation. Our plan is to take the definition as it exists on >> the 12th, and begin generating high-level aggregate numbers based on it. In >> future iterations, we will be digging into different breakdowns of this >> metric, and iterating on it to handle any inconsistencies or unexpected >> results. There's a few differences from Web Stat Collector's (WSC) version >> of the general filter that we want to call to your attention to. >> >> - We include searches -- WSC explicitly excludes them. >> - We include Apps traffic -- WSC does not detect Apps traffic >> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- >> WSC hardcodes "/wiki/" >> - We don't include Banner impressions -- WSC includes them. >> >> There are also some known issues with the new definition that are worth >> your notice: >> >> >> 1. *Internal traffic is counted* >> >> >> - Note that WSC filters some internal traffic by hardcoding a set of >> IPs in the definition. We are working on parsing puppet templates in >> order >> to automatically detect which IPs represent internal traffic. This will >> be >> a /better/ solution, but it's not quite ready yet because parsing puppet >> is >> hard. >> >> >> 1. *Spider traffic is counted* >> >> >> - We will be using the User-agent field to detect and flag >> spider-based traffic. This "tag definition" will be delivered in a >> subsequent definition. This actually matches WSC, which does not filter >> spider for the high-level metrics. >> >> These are problems we're aware of, and will be factoring in as we go >> forward with our next task: refining the definition using real, >> hourly-level traffic data. Thanks to everyone who has given feedback and >> participated in the process thus far, particularly Nemo, Erik, and >> Christian. >> >> 1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >> >> -Aaron & Oliver >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
