FAILED: SemanticException [Error 10014]: Line 12:8 Wrong arguments 'content_type': No matching method for class org.wikimedia.analytics.refinery.hive.IsPageviewUDF with (string, string, string, string, string). Possible choices: _FUNC_(string, string, string, string, string, string)
whut On 16 December 2014 at 10:08, Andrew Otto <[email protected]> wrote: > That might be a email copy/paste problem; I see the non-symetrical quotes > in my email. > > > On Dec 16, 2014, at 09:38, Toby Negrin <[email protected]> wrote: > > Note that in Oliver's example, the quotes are double quotes, not single > quotes. I didn't see the difference immediately. > > -Toby > > On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes <[email protected]> > wrote: >> >> Note that Andrew's example code doesn't run (at least, for me) because it >> needs to be: >> >> CREATE TEMPORARY FUNCTION is_pageview as >> "org.wikimedia.analytics.refinery.hive.IsPageviewUDF"; >> >> Hive gets stupider every time I try to use it ;p >> >> On 15 December 2014 at 20:47, Oliver Keyes <[email protected]> wrote: >>> >>> Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll >>> want the output fired over to limn. >>> >>> On 15 December 2014 at 19:01, Andrew Otto <[email protected]> wrote: >>>> >>>> This needs more testing! Validation! Etc. But woo! >>>> https://gerrit.wikimedia.org/r/#/c/180023 >>>> >>>> This let’s you do: >>>> >>>> >>>> >>>> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar; >>>> >>>> CREATE TEMPORARY FUNCTION is_pageview as >>>> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’; >>>> >>>> SELECT >>>> LOWER(uri_host) as uri_host, >>>> count(*) as pageview_count >>>> FROM >>>> wmf_raw.webrequest >>>> WHERE >>>> (webrequest_source = 'text' or webrequest_source = 'mobile') >>>> AND year=2014 >>>> AND month=12 >>>> AND day=7 >>>> AND hour=12 >>>> AND is_pageview(LOWER(uri_host), uri_path, http_status, >>>> content_type) >>>> GROUP BY >>>> LOWER(uri_host) >>>> ORDER BY pageview_count desc >>>> LIMIT 10 >>>> ; >>>> >>>> … >>>> >>>> uri_host pageview_count >>>> >>>> en.wikipedia.org 6613046 >>>> en.m.wikipedia.org 3223273 >>>> ru.wikipedia.org 2119850 >>>> ja.m.wikipedia.org 1501954 >>>> ja.wikipedia.org 1411533 >>>> de.wikipedia.org 1330252 >>>> zh.wikipedia.org 949228 >>>> fr.wikipedia.org 939602 >>>> commons.wikimedia.org 912965 >>>> de.m.wikipedia.org 664661 >>>> >>>> Time taken: 94.295 seconds, Fetched: 10 row(s) >>>> >>>> >>>> >>>> On Dec 15, 2014, at 16:02, Dario Taraborelli < >>>> [email protected]> wrote: >>>> >>>> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving >>>> on with the implementation. >>>> >>>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]> >>>> wrote: >>>> >>>> Totally! >>>> >>>> On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote: >>>>> >>>>> Ah cool, didn’t realize there was a neutral definition. We should >>>>> call that the ‘formal specification’ then. >>>>> >>>>> ...of course, now that I've said that, cosmic irony demands we end up >>>>> implementing in C, or something. >>>>> >>>>> Hm, a UDF that does this rather than a Hive query would probably be >>>>> better. E.g. >>>>> >>>>> SELECT >>>>> request_qualifier(uri_host), >>>>> count(*) >>>>> FROM >>>>> wmf_raw.webrequest >>>>> WHERE >>>>> is_pageview(uri_host, uri_path, http_status, content_type) >>>>> GROUP BY >>>>> request_qualifier(uri_host) >>>>> ; >>>>> >>>>> >>>>> Or something like that. >>>>> >>>>> -Ao >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote: >>>>> >>>>> It's totally tech-agnostic; the neutral definition is on meta. The >>>>> hive query is just because, since we suspect that's how we'll be >>>>> generating >>>>> the data, it makes sense to turn the draft def into HQL for exploratory >>>>> queries and testing. >>>>> >>>>> ...of course, now that I've said that, cosmic irony demands we end up >>>>> implementing in C, or something. >>>>> >>>>> On 15 December 2014 at 13:46, Toby Negrin <[email protected]> >>>>> wrote: >>>>>> >>>>>> I think the hive code is "representative" in that it's an >>>>>> implementation. It's certainly not the only permitted one. >>>>>> >>>>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]> >>>>>> wrote: >>>>>> >>>>>> We're moving forward to generate Hive queries that will represent >>>>>> the formal specification. >>>>>> >>>>>> Should a specific implementation (e.g. Hive) represent the formal >>>>>> specification? I tend to think it should be tech-agnostic, no? >>>>>> >>>>>> >>>>>> >>>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Toby, that's right. We're moving forward to generate Hive queries >>>>>> that will represent the formal specification. >>>>>> >>>>>> -Aaron >>>>>> >>>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We've written the draft Hive queries and I'm reviewing them with >>>>>>> Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure >>>>>>> we'll >>>>>>> work it through :). >>>>>>> >>>>>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Aaron, all -- >>>>>>>> >>>>>>>> I haven't seen any discussion on this which is a sign that we can >>>>>>>> forward with turning over the draft. Thoughts? >>>>>>>> >>>>>>>> thanks, >>>>>>>> >>>>>>>> -Toby >>>>>>>> >>>>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hey folks, >>>>>>>>> >>>>>>>>> As discussions on the new page view definition have been calming >>>>>>>>> down, we're preparing to deliver a draft version to the Devs. I want >>>>>>>>> to >>>>>>>>> make sure that we all know the status and that any substantial >>>>>>>>> concerns are >>>>>>>>> raised before we hand things off on *Friday, Dec 12th.* >>>>>>>>> >>>>>>>>> For this phase, we are delivering the general filter[1]. This is >>>>>>>>> the highest level filter, and exists primarily to distinguish requests >>>>>>>>> worthy of further evaluation. Our plan is to take the definition as it >>>>>>>>> exists on the 12th, and begin generating high-level aggregate numbers >>>>>>>>> based >>>>>>>>> on it. In future iterations, we will be digging into different >>>>>>>>> breakdowns >>>>>>>>> of this metric, and iterating on it to handle any inconsistencies or >>>>>>>>> unexpected results. There's a few differences from Web Stat >>>>>>>>> Collector's >>>>>>>>> (WSC) version of the general filter that we want to call to your >>>>>>>>> attention >>>>>>>>> to. >>>>>>>>> >>>>>>>>> - We include searches -- WSC explicitly excludes them. >>>>>>>>> - We include Apps traffic -- WSC does not detect Apps traffic >>>>>>>>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, >>>>>>>>> /sr-ec/) -- WSC hardcodes "/wiki/" >>>>>>>>> - We don't include Banner impressions -- WSC includes them. >>>>>>>>> >>>>>>>>> There are also some known issues with the new definition that are >>>>>>>>> worth your notice: >>>>>>>>> >>>>>>>>> >>>>>>>>> 1. *Internal traffic is counted* >>>>>>>>> >>>>>>>>> >>>>>>>>> - Note that WSC filters some internal traffic by hardcoding a >>>>>>>>> set of IPs in the definition. We are working on parsing puppet >>>>>>>>> templates >>>>>>>>> in order to automatically detect which IPs represent internal >>>>>>>>> traffic. >>>>>>>>> This will be a /better/ solution, but it's not quite ready yet >>>>>>>>> because >>>>>>>>> parsing puppet is hard. >>>>>>>>> >>>>>>>>> >>>>>>>>> 1. *Spider traffic is counted* >>>>>>>>> >>>>>>>>> >>>>>>>>> - We will be using the User-agent field to detect and flag >>>>>>>>> spider-based traffic. This "tag definition" will be delivered in a >>>>>>>>> subsequent definition. This actually matches WSC, which does not >>>>>>>>> filter >>>>>>>>> spider for the high-level metrics. >>>>>>>>> >>>>>>>>> These are problems we're aware of, and will be factoring in as we >>>>>>>>> go forward with our next task: refining the definition using real, >>>>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback >>>>>>>>> and >>>>>>>>> participated in the process thus far, particularly Nemo, Erik, and >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> 1. >>>>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>>>>>> >>>>>>>>> -Aaron & Oliver >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Analytics mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Oliver Keyes >>>>>>> Research Analyst >>>>>>> Wikimedia Foundation >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Oliver Keyes >>>>> Research Analyst >>>>> Wikimedia Foundation >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
