Totally! On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote: > > Ah cool, didn’t realize there was a neutral definition. We should call > that the ‘formal specification’ then. > > ...of course, now that I've said that, cosmic irony demands we end up > implementing in C, or something. > > Hm, a UDF that does this rather than a Hive query would probably be > better. E.g. > > SELECT > request_qualifier(uri_host), > count(*) > FROM > wmf_raw.webrequest > WHERE > is_pageview(uri_host, uri_path, http_status, content_type) > GROUP BY > request_qualifier(uri_host) > ; > > > Or something like that. > > -Ao > > > > > > > On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote: > > It's totally tech-agnostic; the neutral definition is on meta. The hive > query is just because, since we suspect that's how we'll be generating the > data, it makes sense to turn the draft def into HQL for exploratory queries > and testing. > > ...of course, now that I've said that, cosmic irony demands we end up > implementing in C, or something. > > On 15 December 2014 at 13:46, Toby Negrin <[email protected]> wrote: >> >> I think the hive code is "representative" in that it's an implementation. >> It's certainly not the only permitted one. >> >> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]> wrote: >> >> We're moving forward to generate Hive queries that will represent the >> formal specification. >> >> Should a specific implementation (e.g. Hive) represent the formal >> specification? I tend to think it should be tech-agnostic, no? >> >> >> >> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]> >> wrote: >> >> Toby, that's right. We're moving forward to generate Hive queries that >> will represent the formal specification. >> >> -Aaron >> >> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]> >> wrote: >> >>> We've written the draft Hive queries and I'm reviewing them with Otto >>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work >>> it through :). >>> >>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]> wrote: >>>> >>>> Hi Aaron, all -- >>>> >>>> I haven't seen any discussion on this which is a sign that we can >>>> forward with turning over the draft. Thoughts? >>>> >>>> thanks, >>>> >>>> -Toby >>>> >>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <[email protected] >>>> > wrote: >>>> >>>>> Hey folks, >>>>> >>>>> As discussions on the new page view definition have been calming down, >>>>> we're preparing to deliver a draft version to the Devs. I want to make >>>>> sure that we all know the status and that any substantial concerns are >>>>> raised before we hand things off on *Friday, Dec 12th.* >>>>> >>>>> For this phase, we are delivering the general filter[1]. This is the >>>>> highest level filter, and exists primarily to distinguish requests worthy >>>>> of further evaluation. Our plan is to take the definition as it exists on >>>>> the 12th, and begin generating high-level aggregate numbers based on it. >>>>> In >>>>> future iterations, we will be digging into different breakdowns of this >>>>> metric, and iterating on it to handle any inconsistencies or unexpected >>>>> results. There's a few differences from Web Stat Collector's (WSC) >>>>> version >>>>> of the general filter that we want to call to your attention to. >>>>> >>>>> - We include searches -- WSC explicitly excludes them. >>>>> - We include Apps traffic -- WSC does not detect Apps traffic >>>>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) >>>>> -- WSC hardcodes "/wiki/" >>>>> - We don't include Banner impressions -- WSC includes them. >>>>> >>>>> There are also some known issues with the new definition that are >>>>> worth your notice: >>>>> >>>>> >>>>> 1. *Internal traffic is counted* >>>>> >>>>> >>>>> - Note that WSC filters some internal traffic by hardcoding a set >>>>> of IPs in the definition. We are working on parsing puppet templates >>>>> in >>>>> order to automatically detect which IPs represent internal traffic. >>>>> This >>>>> will be a /better/ solution, but it's not quite ready yet because >>>>> parsing >>>>> puppet is hard. >>>>> >>>>> >>>>> 1. *Spider traffic is counted* >>>>> >>>>> >>>>> - We will be using the User-agent field to detect and flag >>>>> spider-based traffic. This "tag definition" will be delivered in a >>>>> subsequent definition. This actually matches WSC, which does not >>>>> filter >>>>> spider for the high-level metrics. >>>>> >>>>> These are problems we're aware of, and will be factoring in as we go >>>>> forward with our next task: refining the definition using real, >>>>> hourly-level traffic data. Thanks to everyone who has given feedback and >>>>> participated in the process thus far, particularly Nemo, Erik, and >>>>> Christian. >>>>> >>>>> 1. >>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>> >>>>> -Aaron & Oliver >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
-- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
