Note that in Oliver's example, the quotes are double quotes, not single quotes. I didn't see the difference immediately.
-Toby On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes <[email protected]> wrote: > > Note that Andrew's example code doesn't run (at least, for me) because it > needs to be: > > CREATE TEMPORARY FUNCTION is_pageview as > "org.wikimedia.analytics.refinery.hive.IsPageviewUDF"; > > Hive gets stupider every time I try to use it ;p > > On 15 December 2014 at 20:47, Oliver Keyes <[email protected]> wrote: >> >> Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll >> want the output fired over to limn. >> >> On 15 December 2014 at 19:01, Andrew Otto <[email protected]> wrote: >>> >>> This needs more testing! Validation! Etc. But woo! >>> https://gerrit.wikimedia.org/r/#/c/180023 >>> >>> This let’s you do: >>> >>> >>> >>> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar; >>> >>> CREATE TEMPORARY FUNCTION is_pageview as >>> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’; >>> >>> SELECT >>> LOWER(uri_host) as uri_host, >>> count(*) as pageview_count >>> FROM >>> wmf_raw.webrequest >>> WHERE >>> (webrequest_source = 'text' or webrequest_source = 'mobile') >>> AND year=2014 >>> AND month=12 >>> AND day=7 >>> AND hour=12 >>> AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) >>> GROUP BY >>> LOWER(uri_host) >>> ORDER BY pageview_count desc >>> LIMIT 10 >>> ; >>> >>> … >>> >>> uri_host pageview_count >>> >>> en.wikipedia.org 6613046 >>> en.m.wikipedia.org 3223273 >>> ru.wikipedia.org 2119850 >>> ja.m.wikipedia.org 1501954 >>> ja.wikipedia.org 1411533 >>> de.wikipedia.org 1330252 >>> zh.wikipedia.org 949228 >>> fr.wikipedia.org 939602 >>> commons.wikimedia.org 912965 >>> de.m.wikipedia.org 664661 >>> >>> Time taken: 94.295 seconds, Fetched: 10 row(s) >>> >>> >>> >>> On Dec 15, 2014, at 16:02, Dario Taraborelli <[email protected]> >>> wrote: >>> >>> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving >>> on with the implementation. >>> >>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]> wrote: >>> >>> Totally! >>> >>> On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote: >>>> >>>> Ah cool, didn’t realize there was a neutral definition. We should call >>>> that the ‘formal specification’ then. >>>> >>>> ...of course, now that I've said that, cosmic irony demands we end up >>>> implementing in C, or something. >>>> >>>> Hm, a UDF that does this rather than a Hive query would probably be >>>> better. E.g. >>>> >>>> SELECT >>>> request_qualifier(uri_host), >>>> count(*) >>>> FROM >>>> wmf_raw.webrequest >>>> WHERE >>>> is_pageview(uri_host, uri_path, http_status, content_type) >>>> GROUP BY >>>> request_qualifier(uri_host) >>>> ; >>>> >>>> >>>> Or something like that. >>>> >>>> -Ao >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote: >>>> >>>> It's totally tech-agnostic; the neutral definition is on meta. The hive >>>> query is just because, since we suspect that's how we'll be generating the >>>> data, it makes sense to turn the draft def into HQL for exploratory queries >>>> and testing. >>>> >>>> ...of course, now that I've said that, cosmic irony demands we end up >>>> implementing in C, or something. >>>> >>>> On 15 December 2014 at 13:46, Toby Negrin <[email protected]> >>>> wrote: >>>>> >>>>> I think the hive code is "representative" in that it's an >>>>> implementation. It's certainly not the only permitted one. >>>>> >>>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]> wrote: >>>>> >>>>> We're moving forward to generate Hive queries that will represent the >>>>> formal specification. >>>>> >>>>> Should a specific implementation (e.g. Hive) represent the formal >>>>> specification? I tend to think it should be tech-agnostic, no? >>>>> >>>>> >>>>> >>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]> >>>>> wrote: >>>>> >>>>> Toby, that's right. We're moving forward to generate Hive queries >>>>> that will represent the formal specification. >>>>> >>>>> -Aaron >>>>> >>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]> >>>>> wrote: >>>>> >>>>>> We've written the draft Hive queries and I'm reviewing them with Otto >>>>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work >>>>>> it through :). >>>>>> >>>>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Aaron, all -- >>>>>>> >>>>>>> I haven't seen any discussion on this which is a sign that we can >>>>>>> forward with turning over the draft. Thoughts? >>>>>>> >>>>>>> thanks, >>>>>>> >>>>>>> -Toby >>>>>>> >>>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hey folks, >>>>>>>> >>>>>>>> As discussions on the new page view definition have been calming >>>>>>>> down, we're preparing to deliver a draft version to the Devs. I want >>>>>>>> to >>>>>>>> make sure that we all know the status and that any substantial >>>>>>>> concerns are >>>>>>>> raised before we hand things off on *Friday, Dec 12th.* >>>>>>>> >>>>>>>> For this phase, we are delivering the general filter[1]. This is >>>>>>>> the highest level filter, and exists primarily to distinguish requests >>>>>>>> worthy of further evaluation. Our plan is to take the definition as it >>>>>>>> exists on the 12th, and begin generating high-level aggregate numbers >>>>>>>> based >>>>>>>> on it. In future iterations, we will be digging into different >>>>>>>> breakdowns >>>>>>>> of this metric, and iterating on it to handle any inconsistencies or >>>>>>>> unexpected results. There's a few differences from Web Stat >>>>>>>> Collector's >>>>>>>> (WSC) version of the general filter that we want to call to your >>>>>>>> attention >>>>>>>> to. >>>>>>>> >>>>>>>> - We include searches -- WSC explicitly excludes them. >>>>>>>> - We include Apps traffic -- WSC does not detect Apps traffic >>>>>>>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, >>>>>>>> /sr-ec/) -- WSC hardcodes "/wiki/" >>>>>>>> - We don't include Banner impressions -- WSC includes them. >>>>>>>> >>>>>>>> There are also some known issues with the new definition that are >>>>>>>> worth your notice: >>>>>>>> >>>>>>>> >>>>>>>> 1. *Internal traffic is counted* >>>>>>>> >>>>>>>> >>>>>>>> - Note that WSC filters some internal traffic by hardcoding a >>>>>>>> set of IPs in the definition. We are working on parsing puppet >>>>>>>> templates >>>>>>>> in order to automatically detect which IPs represent internal >>>>>>>> traffic. >>>>>>>> This will be a /better/ solution, but it's not quite ready yet >>>>>>>> because >>>>>>>> parsing puppet is hard. >>>>>>>> >>>>>>>> >>>>>>>> 1. *Spider traffic is counted* >>>>>>>> >>>>>>>> >>>>>>>> - We will be using the User-agent field to detect and flag >>>>>>>> spider-based traffic. This "tag definition" will be delivered in a >>>>>>>> subsequent definition. This actually matches WSC, which does not >>>>>>>> filter >>>>>>>> spider for the high-level metrics. >>>>>>>> >>>>>>>> These are problems we're aware of, and will be factoring in as we >>>>>>>> go forward with our next task: refining the definition using real, >>>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback >>>>>>>> and >>>>>>>> participated in the process thus far, particularly Nemo, Erik, and >>>>>>>> Christian. >>>>>>>> >>>>>>>> 1. >>>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>>>>> >>>>>>>> -Aaron & Oliver >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Oliver Keyes >>>>>> Research Analyst >>>>>> Wikimedia Foundation >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
