Note that Andrew's example code doesn't run (at least, for me) because it
needs to be:

CREATE TEMPORARY FUNCTION is_pageview as
"org.wikimedia.analytics.refinery.hive.IsPageviewUDF";

Hive gets stupider every time I try to use it ;p

On 15 December 2014 at 20:47, Oliver Keyes <[email protected]> wrote:
>
> Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll
> want the output fired over to limn.
>
> On 15 December 2014 at 19:01, Andrew Otto <[email protected]> wrote:
>>
>> This needs more testing!  Validation!  Etc.  But woo!
>> https://gerrit.wikimedia.org/r/#/c/180023
>>
>> This let’s you do:
>>
>>
>>
>> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
>>
>> CREATE TEMPORARY FUNCTION is_pageview as
>> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
>>
>>   SELECT
>>     LOWER(uri_host) as uri_host,
>>     count(*) as pageview_count
>>   FROM
>>     wmf_raw.webrequest
>>   WHERE
>>    (webrequest_source = 'text' or webrequest_source = 'mobile')
>>     AND year=2014
>>     AND month=12
>>     AND day=7
>>     AND hour=12
>>     AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type)
>>   GROUP BY
>>     LOWER(uri_host)
>>   ORDER BY pageview_count desc
>>   LIMIT 10
>> ;
>>
>> …
>>
>> uri_host         pageview_count
>>
>> en.wikipedia.org 6613046
>> en.m.wikipedia.org 3223273
>> ru.wikipedia.org 2119850
>> ja.m.wikipedia.org 1501954
>> ja.wikipedia.org 1411533
>> de.wikipedia.org 1330252
>> zh.wikipedia.org 949228
>> fr.wikipedia.org 939602
>> commons.wikimedia.org 912965
>> de.m.wikipedia.org 664661
>>
>> Time taken: 94.295 seconds, Fetched: 10 row(s)
>>
>>
>>
>> On Dec 15, 2014, at 16:02, Dario Taraborelli <[email protected]>
>> wrote:
>>
>> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving
>> on with the implementation.
>>
>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]> wrote:
>>
>> Totally!
>>
>> On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote:
>>>
>>> Ah cool, didn’t realize there was a neutral definition.  We should call
>>> that the ‘formal specification’ then.
>>>
>>> ...of course, now that I've said that, cosmic irony demands we end up
>>> implementing in C, or something.
>>>
>>> Hm, a UDF that does this rather than a Hive query would probably be
>>> better.  E.g.
>>>
>>>   SELECT
>>>     request_qualifier(uri_host),
>>>     count(*)
>>>   FROM
>>>     wmf_raw.webrequest
>>>   WHERE
>>>     is_pageview(uri_host, uri_path, http_status, content_type)
>>>   GROUP BY
>>>     request_qualifier(uri_host)
>>>   ;
>>>
>>>
>>> Or something like that.
>>>
>>> -Ao
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote:
>>>
>>> It's totally tech-agnostic; the neutral definition is on meta. The hive
>>> query is just because, since we suspect that's how we'll be generating the
>>> data, it makes sense to turn the draft def into HQL for exploratory queries
>>> and testing.
>>>
>>> ...of course, now that I've said that, cosmic irony demands we end up
>>> implementing in C, or something.
>>>
>>> On 15 December 2014 at 13:46, Toby Negrin <[email protected]> wrote:
>>>>
>>>> I think the hive code is "representative" in that it's an
>>>> implementation. It's certainly not the only permitted one.
>>>>
>>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]> wrote:
>>>>
>>>>  We're moving forward to generate Hive queries that will represent the
>>>> formal specification.
>>>>
>>>> Should a specific implementation (e.g. Hive) represent the formal
>>>> specification?  I tend to think it should be tech-agnostic, no?
>>>>
>>>>
>>>>
>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]>
>>>> wrote:
>>>>
>>>> Toby, that's right.  We're moving forward to generate Hive queries that
>>>> will represent the formal specification.
>>>>
>>>> -Aaron
>>>>
>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]>
>>>> wrote:
>>>>
>>>>> We've written the draft Hive queries and I'm reviewing them with Otto
>>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work
>>>>> it through :).
>>>>>
>>>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Aaron, all --
>>>>>>
>>>>>> I haven't seen any discussion on this which is a sign that we can
>>>>>> forward with turning over the draft. Thoughts?
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> -Toby
>>>>>>
>>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> As discussions on the new page view definition have been calming
>>>>>>> down, we're preparing to deliver a draft version to the Devs.  I want to
>>>>>>> make sure that we all know the status and that any substantial concerns 
>>>>>>> are
>>>>>>> raised before we hand things off on *Friday, Dec 12th.*
>>>>>>>
>>>>>>> For this phase, we are delivering the general filter[1].  This is
>>>>>>> the highest level filter, and exists primarily to distinguish requests
>>>>>>> worthy of further evaluation. Our plan is to take the definition as it
>>>>>>> exists on the 12th, and begin generating high-level aggregate numbers 
>>>>>>> based
>>>>>>> on it. In future iterations, we will be digging into different 
>>>>>>> breakdowns
>>>>>>> of this metric, and iterating on it to handle any inconsistencies or
>>>>>>> unexpected results.  There's a few differences from Web Stat Collector's
>>>>>>> (WSC) version of the general filter that we want to call to your 
>>>>>>> attention
>>>>>>> to.
>>>>>>>
>>>>>>>    - We include searches -- WSC explicitly excludes them.
>>>>>>>    - We include Apps traffic -- WSC does not detect Apps traffic
>>>>>>>    - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/)
>>>>>>>    -- WSC hardcodes "/wiki/"
>>>>>>>    - We don't include Banner impressions -- WSC includes them.
>>>>>>>
>>>>>>> There are also some known issues with the new definition that are
>>>>>>> worth your notice:
>>>>>>>
>>>>>>>
>>>>>>>    1. *Internal traffic is counted*
>>>>>>>
>>>>>>>
>>>>>>>    - Note that WSC filters some internal traffic by hardcoding a
>>>>>>>    set of IPs in the definition.  We are working on parsing puppet 
>>>>>>> templates
>>>>>>>    in order to automatically detect which IPs represent internal 
>>>>>>> traffic.
>>>>>>>    This will be a /better/ solution, but it's not quite ready yet 
>>>>>>> because
>>>>>>>    parsing puppet is hard.
>>>>>>>
>>>>>>>
>>>>>>>    1. *Spider traffic is counted*
>>>>>>>
>>>>>>>
>>>>>>>    - We will be using the User-agent field to detect and flag
>>>>>>>    spider-based traffic.  This "tag definition" will be delivered in a
>>>>>>>    subsequent definition.  This actually matches WSC, which does not 
>>>>>>> filter
>>>>>>>    spider for the high-level metrics.
>>>>>>>
>>>>>>> These are problems we're aware of, and will be factoring in as we go
>>>>>>> forward with our next task: refining the definition using real,
>>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and
>>>>>>> participated in the process thus far, particularly Nemo, Erik, and
>>>>>>> Christian.
>>>>>>>
>>>>>>> 1.
>>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>>>>>
>>>>>>> -Aaron & Oliver
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Oliver Keyes
>>>>> Research Analyst
>>>>> Wikimedia Foundation
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>  _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>  _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>


-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to