Note that Andrew's example code doesn't run (at least, for me) because it needs to be:
CREATE TEMPORARY FUNCTION is_pageview as "org.wikimedia.analytics.refinery.hive.IsPageviewUDF"; Hive gets stupider every time I try to use it ;p On 15 December 2014 at 20:47, Oliver Keyes <[email protected]> wrote: > > Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll > want the output fired over to limn. > > On 15 December 2014 at 19:01, Andrew Otto <[email protected]> wrote: >> >> This needs more testing! Validation! Etc. But woo! >> https://gerrit.wikimedia.org/r/#/c/180023 >> >> This let’s you do: >> >> >> >> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar; >> >> CREATE TEMPORARY FUNCTION is_pageview as >> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’; >> >> SELECT >> LOWER(uri_host) as uri_host, >> count(*) as pageview_count >> FROM >> wmf_raw.webrequest >> WHERE >> (webrequest_source = 'text' or webrequest_source = 'mobile') >> AND year=2014 >> AND month=12 >> AND day=7 >> AND hour=12 >> AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) >> GROUP BY >> LOWER(uri_host) >> ORDER BY pageview_count desc >> LIMIT 10 >> ; >> >> … >> >> uri_host pageview_count >> >> en.wikipedia.org 6613046 >> en.m.wikipedia.org 3223273 >> ru.wikipedia.org 2119850 >> ja.m.wikipedia.org 1501954 >> ja.wikipedia.org 1411533 >> de.wikipedia.org 1330252 >> zh.wikipedia.org 949228 >> fr.wikipedia.org 939602 >> commons.wikimedia.org 912965 >> de.m.wikipedia.org 664661 >> >> Time taken: 94.295 seconds, Fetched: 10 row(s) >> >> >> >> On Dec 15, 2014, at 16:02, Dario Taraborelli <[email protected]> >> wrote: >> >> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving >> on with the implementation. >> >> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]> wrote: >> >> Totally! >> >> On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote: >>> >>> Ah cool, didn’t realize there was a neutral definition. We should call >>> that the ‘formal specification’ then. >>> >>> ...of course, now that I've said that, cosmic irony demands we end up >>> implementing in C, or something. >>> >>> Hm, a UDF that does this rather than a Hive query would probably be >>> better. E.g. >>> >>> SELECT >>> request_qualifier(uri_host), >>> count(*) >>> FROM >>> wmf_raw.webrequest >>> WHERE >>> is_pageview(uri_host, uri_path, http_status, content_type) >>> GROUP BY >>> request_qualifier(uri_host) >>> ; >>> >>> >>> Or something like that. >>> >>> -Ao >>> >>> >>> >>> >>> >>> >>> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote: >>> >>> It's totally tech-agnostic; the neutral definition is on meta. The hive >>> query is just because, since we suspect that's how we'll be generating the >>> data, it makes sense to turn the draft def into HQL for exploratory queries >>> and testing. >>> >>> ...of course, now that I've said that, cosmic irony demands we end up >>> implementing in C, or something. >>> >>> On 15 December 2014 at 13:46, Toby Negrin <[email protected]> wrote: >>>> >>>> I think the hive code is "representative" in that it's an >>>> implementation. It's certainly not the only permitted one. >>>> >>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]> wrote: >>>> >>>> We're moving forward to generate Hive queries that will represent the >>>> formal specification. >>>> >>>> Should a specific implementation (e.g. Hive) represent the formal >>>> specification? I tend to think it should be tech-agnostic, no? >>>> >>>> >>>> >>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]> >>>> wrote: >>>> >>>> Toby, that's right. We're moving forward to generate Hive queries that >>>> will represent the formal specification. >>>> >>>> -Aaron >>>> >>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]> >>>> wrote: >>>> >>>>> We've written the draft Hive queries and I'm reviewing them with Otto >>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work >>>>> it through :). >>>>> >>>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi Aaron, all -- >>>>>> >>>>>> I haven't seen any discussion on this which is a sign that we can >>>>>> forward with turning over the draft. Thoughts? >>>>>> >>>>>> thanks, >>>>>> >>>>>> -Toby >>>>>> >>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hey folks, >>>>>>> >>>>>>> As discussions on the new page view definition have been calming >>>>>>> down, we're preparing to deliver a draft version to the Devs. I want to >>>>>>> make sure that we all know the status and that any substantial concerns >>>>>>> are >>>>>>> raised before we hand things off on *Friday, Dec 12th.* >>>>>>> >>>>>>> For this phase, we are delivering the general filter[1]. This is >>>>>>> the highest level filter, and exists primarily to distinguish requests >>>>>>> worthy of further evaluation. Our plan is to take the definition as it >>>>>>> exists on the 12th, and begin generating high-level aggregate numbers >>>>>>> based >>>>>>> on it. In future iterations, we will be digging into different >>>>>>> breakdowns >>>>>>> of this metric, and iterating on it to handle any inconsistencies or >>>>>>> unexpected results. There's a few differences from Web Stat Collector's >>>>>>> (WSC) version of the general filter that we want to call to your >>>>>>> attention >>>>>>> to. >>>>>>> >>>>>>> - We include searches -- WSC explicitly excludes them. >>>>>>> - We include Apps traffic -- WSC does not detect Apps traffic >>>>>>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) >>>>>>> -- WSC hardcodes "/wiki/" >>>>>>> - We don't include Banner impressions -- WSC includes them. >>>>>>> >>>>>>> There are also some known issues with the new definition that are >>>>>>> worth your notice: >>>>>>> >>>>>>> >>>>>>> 1. *Internal traffic is counted* >>>>>>> >>>>>>> >>>>>>> - Note that WSC filters some internal traffic by hardcoding a >>>>>>> set of IPs in the definition. We are working on parsing puppet >>>>>>> templates >>>>>>> in order to automatically detect which IPs represent internal >>>>>>> traffic. >>>>>>> This will be a /better/ solution, but it's not quite ready yet >>>>>>> because >>>>>>> parsing puppet is hard. >>>>>>> >>>>>>> >>>>>>> 1. *Spider traffic is counted* >>>>>>> >>>>>>> >>>>>>> - We will be using the User-agent field to detect and flag >>>>>>> spider-based traffic. This "tag definition" will be delivered in a >>>>>>> subsequent definition. This actually matches WSC, which does not >>>>>>> filter >>>>>>> spider for the high-level metrics. >>>>>>> >>>>>>> These are problems we're aware of, and will be factoring in as we go >>>>>>> forward with our next task: refining the definition using real, >>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and >>>>>>> participated in the process thus far, particularly Nemo, Erik, and >>>>>>> Christian. >>>>>>> >>>>>>> 1. >>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>>>> >>>>>>> -Aaron & Oliver >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Oliver Keyes >>>>> Research Analyst >>>>> Wikimedia Foundation >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
