This needs more testing! Validation! Etc. But woo!
https://gerrit.wikimedia.org/r/#/c/180023
<https://gerrit.wikimedia.org/r/#/c/180023>
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as
'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT
LOWER(uri_host) as uri_host,
count(*) as pageview_count
FROM
wmf_raw.webrequest
WHERE
(webrequest_source = 'text' or webrequest_source = 'mobile')
AND year=2014
AND month=12
AND day=7
AND hour=12
AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type)
GROUP BY
LOWER(uri_host)
ORDER BY pageview_count desc
LIMIT 10
;
…
uri_host pageview_count
en.wikipedia.org 6613046
en.m.wikipedia.org 3223273
ru.wikipedia.org 2119850
ja.m.wikipedia.org 1501954
ja.wikipedia.org 1411533
de.wikipedia.org 1330252
zh.wikipedia.org 949228
fr.wikipedia.org 939602
commons.wikimedia.org 912965
de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
> On Dec 15, 2014, at 16:02, Dario Taraborelli <[email protected]>
> wrote:
>
> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on
> with the implementation.
>
>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Totally!
>>
>> On 15 December 2014 at 14:22, Andrew Otto <[email protected]
>> <mailto:[email protected]>> wrote:
>> Ah cool, didn’t realize there was a neutral definition. We should call that
>> the ‘formal specification’ then.
>>
>>> ...of course, now that I've said that, cosmic irony demands we end up
>>> implementing in C, or something.
>> Hm, a UDF that does this rather than a Hive query would probably be better.
>> E.g.
>>
>> SELECT
>> request_qualifier(uri_host),
>> count(*)
>> FROM
>> wmf_raw.webrequest
>> WHERE
>> is_pageview(uri_host, uri_path, http_status, content_type)
>> GROUP BY
>> request_qualifier(uri_host)
>> ;
>>
>>
>> Or something like that.
>>
>> -Ao
>>
>>
>>
>>
>>
>>
>>> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> It's totally tech-agnostic; the neutral definition is on meta. The hive
>>> query is just because, since we suspect that's how we'll be generating the
>>> data, it makes sense to turn the draft def into HQL for exploratory queries
>>> and testing.
>>>
>>> ...of course, now that I've said that, cosmic irony demands we end up
>>> implementing in C, or something.
>>>
>>> On 15 December 2014 at 13:46, Toby Negrin <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> I think the hive code is "representative" in that it's an implementation.
>>> It's certainly not the only permitted one.
>>>
>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>>> We're moving forward to generate Hive queries that will represent the
>>>>> formal specification.
>>>> Should a specific implementation (e.g. Hive) represent the formal
>>>> specification? I tend to think it should be tech-agnostic, no?
>>>>
>>>>
>>>>
>>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> Toby, that's right. We're moving forward to generate Hive queries that
>>>>> will represent the formal specification.
>>>>>
>>>>> -Aaron
>>>>>
>>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> We've written the draft Hive queries and I'm reviewing them with Otto
>>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work
>>>>> it through :).
>>>>>
>>>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hi Aaron, all --
>>>>>
>>>>> I haven't seen any discussion on this which is a sign that we can forward
>>>>> with turning over the draft. Thoughts?
>>>>>
>>>>> thanks,
>>>>>
>>>>> -Toby
>>>>>
>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hey folks,
>>>>>
>>>>> As discussions on the new page view definition have been calming down,
>>>>> we're preparing to deliver a draft version to the Devs. I want to make
>>>>> sure that we all know the status and that any substantial concerns are
>>>>> raised before we hand things off on Friday, Dec 12th.
>>>>>
>>>>> For this phase, we are delivering the general filter[1]. This is the
>>>>> highest level filter, and exists primarily to distinguish requests worthy
>>>>> of further evaluation. Our plan is to take the definition as it exists on
>>>>> the 12th, and begin generating high-level aggregate numbers based on it.
>>>>> In future iterations, we will be digging into different breakdowns of
>>>>> this metric, and iterating on it to handle any inconsistencies or
>>>>> unexpected results. There's a few differences from Web Stat Collector's
>>>>> (WSC) version of the general filter that we want to call to your
>>>>> attention to.
>>>>> We include searches -- WSC explicitly excludes them.
>>>>> We include Apps traffic -- WSC does not detect Apps traffic
>>>>> We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC
>>>>> hardcodes "/wiki/"
>>>>> We don't include Banner impressions -- WSC includes them.
>>>>> There are also some known issues with the new definition that are worth
>>>>> your notice:
>>>>>
>>>>> Internal traffic is counted
>>>>> Note that WSC filters some internal traffic by hardcoding a set of IPs in
>>>>> the definition. We are working on parsing puppet templates in order to
>>>>> automatically detect which IPs represent internal traffic. This will be
>>>>> a /better/ solution, but it's not quite ready yet because parsing puppet
>>>>> is hard.
>>>>> Spider traffic is counted
>>>>> We will be using the User-agent field to detect and flag spider-based
>>>>> traffic. This "tag definition" will be delivered in a subsequent
>>>>> definition. This actually matches WSC, which does not filter spider for
>>>>> the high-level metrics.
>>>>> These are problems we're aware of, and will be factoring in as we go
>>>>> forward with our next task: refining the definition using real,
>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and
>>>>> participated in the process thus far, particularly Nemo, Erik, and
>>>>> Christian.
>>>>>
>>>>> 1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>>> <https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters>
>>>>>
>>>>> -Aaron & Oliver
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Oliver Keyes
>>>>> Research Analyst
>>>>> Wikimedia Foundation
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>> _______________________________________________
>> Analytics mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics