Ah cool, didn’t realize there was a neutral definition.  We should call that 
the ‘formal specification’ then.

> ...of course, now that I've said that, cosmic irony demands we end up 
> implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better.  
E.g.

  SELECT
    request_qualifier(uri_host),
    count(*)
  FROM
    wmf_raw.webrequest
  WHERE
    is_pageview(uri_host, uri_path, http_status, content_type)
  GROUP BY
    request_qualifier(uri_host)
  ;


Or something like that.

-Ao






> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote:
> 
> It's totally tech-agnostic; the neutral definition is on meta. The hive query 
> is just because, since we suspect that's how we'll be generating the data, it 
> makes sense to turn the draft def into HQL for exploratory queries and 
> testing.
> 
> ...of course, now that I've said that, cosmic irony demands we end up 
> implementing in C, or something.
> 
> On 15 December 2014 at 13:46, Toby Negrin <[email protected] 
> <mailto:[email protected]>> wrote:
> I think the hive code is "representative" in that it's an implementation. 
> It's certainly not the only permitted one. 
> 
> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>>>  We're moving forward to generate Hive queries that will represent the 
>>> formal specification.
>> Should a specific implementation (e.g. Hive) represent the formal 
>> specification?  I tend to think it should be tech-agnostic, no?
>> 
>> 
>> 
>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Toby, that's right.  We're moving forward to generate Hive queries that 
>>> will represent the formal specification.  
>>> 
>>> -Aaron
>>> 
>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> We've written the draft Hive queries and I'm reviewing them with Otto now. 
>>> Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it 
>>> through :).
>>> 
>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi Aaron, all --
>>> 
>>> I haven't seen any discussion on this which is a sign that we can forward 
>>> with turning over the draft. Thoughts?
>>> 
>>> thanks,
>>> 
>>> -Toby
>>> 
>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hey folks,
>>> 
>>> As discussions on the new page view definition have been calming down, 
>>> we're preparing to deliver a draft version to the Devs.  I want to make 
>>> sure that we all know the status and that any substantial concerns are 
>>> raised before we hand things off on Friday, Dec 12th.
>>> 
>>> For this phase, we are delivering the general filter[1].  This is the 
>>> highest level filter, and exists primarily to distinguish requests worthy 
>>> of further evaluation. Our plan is to take the definition as it exists on 
>>> the 12th, and begin generating high-level aggregate numbers based on it. In 
>>> future iterations, we will be digging into different breakdowns of this 
>>> metric, and iterating on it to handle any inconsistencies or unexpected 
>>> results.  There's a few differences from Web Stat Collector's (WSC) version 
>>> of the general filter that we want to call to your attention to.
>>> We include searches -- WSC explicitly excludes them.
>>> We include Apps traffic -- WSC does not detect Apps traffic
>>> We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC 
>>> hardcodes "/wiki/"
>>> We don't include Banner impressions -- WSC includes them.
>>> There are also some known issues with the new definition that are worth 
>>> your notice:
>>>     
>>> Internal traffic is counted
>>> Note that WSC filters some internal traffic by hardcoding a set of IPs in 
>>> the definition.  We are working on parsing puppet templates in order to 
>>> automatically detect which IPs represent internal traffic.  This will be a 
>>> /better/ solution, but it's not quite ready yet because parsing puppet is 
>>> hard.  
>>> Spider traffic is counted
>>> We will be using the User-agent field to detect and flag spider-based 
>>> traffic.  This "tag definition" will be delivered in a subsequent 
>>> definition.  This actually matches WSC, which does not filter spider for 
>>> the high-level metrics.
>>> These are problems we're aware of, and will be factoring in as we go 
>>> forward with our next task: refining the definition using real, 
>>> hourly-level traffic data. Thanks to everyone who has given feedback and 
>>> participated in the process thus far, particularly Nemo, Erik, and 
>>> Christian.
>>> 
>>> 1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters 
>>> <https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters>
>>> 
>>> -Aaron & Oliver
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>> 
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>> 
>>> 
>>> 
>>> -- 
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>> 
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> _______________________________________________
> Analytics mailing list
> [email protected] <mailto:[email protected]>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to