FAILED: SemanticException [Error 10014]: Line 12:8 Wrong arguments
'content_type': No matching method for class
org.wikimedia.analytics.refinery.hive.IsPageviewUDF with (string, string,
string, string, string). Possible choices: _FUNC_(string, string, string,
string, string, string)

whut

On 16 December 2014 at 10:08, Andrew Otto <[email protected]> wrote:

> That might be a email copy/paste problem; I see the non-symetrical quotes
> in my email.
>
>
> On Dec 16, 2014, at 09:38, Toby Negrin <[email protected]> wrote:
>
> Note that in Oliver's example, the quotes are double quotes, not single
> quotes. I didn't see the difference immediately.
>
> -Toby
>
> On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes <[email protected]>
> wrote:
>>
>> Note that Andrew's example code doesn't run (at least, for me) because it
>> needs to be:
>>
>> CREATE TEMPORARY FUNCTION is_pageview as
>> "org.wikimedia.analytics.refinery.hive.IsPageviewUDF";
>>
>> Hive gets stupider every time I try to use it ;p
>>
>> On 15 December 2014 at 20:47, Oliver Keyes <[email protected]> wrote:
>>>
>>> Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll
>>> want the output fired over to limn.
>>>
>>> On 15 December 2014 at 19:01, Andrew Otto <[email protected]> wrote:
>>>>
>>>> This needs more testing!  Validation!  Etc.  But woo!
>>>> https://gerrit.wikimedia.org/r/#/c/180023
>>>>
>>>> This let’s you do:
>>>>
>>>>
>>>>
>>>> ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
>>>>
>>>> CREATE TEMPORARY FUNCTION is_pageview as
>>>> 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
>>>>
>>>>   SELECT
>>>>     LOWER(uri_host) as uri_host,
>>>>     count(*) as pageview_count
>>>>   FROM
>>>>     wmf_raw.webrequest
>>>>   WHERE
>>>>    (webrequest_source = 'text' or webrequest_source = 'mobile')
>>>>     AND year=2014
>>>>     AND month=12
>>>>     AND day=7
>>>>     AND hour=12
>>>>     AND is_pageview(LOWER(uri_host), uri_path, http_status,
>>>> content_type)
>>>>   GROUP BY
>>>>     LOWER(uri_host)
>>>>   ORDER BY pageview_count desc
>>>>   LIMIT 10
>>>> ;
>>>>
>>>> …
>>>>
>>>> uri_host         pageview_count
>>>>
>>>> en.wikipedia.org 6613046
>>>> en.m.wikipedia.org 3223273
>>>> ru.wikipedia.org 2119850
>>>> ja.m.wikipedia.org 1501954
>>>> ja.wikipedia.org 1411533
>>>> de.wikipedia.org 1330252
>>>> zh.wikipedia.org 949228
>>>> fr.wikipedia.org 939602
>>>> commons.wikimedia.org 912965
>>>> de.m.wikipedia.org 664661
>>>>
>>>> Time taken: 94.295 seconds, Fetched: 10 row(s)
>>>>
>>>>
>>>>
>>>> On Dec 15, 2014, at 16:02, Dario Taraborelli <
>>>> [email protected]> wrote:
>>>>
>>>> Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving
>>>> on with the implementation.
>>>>
>>>> On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]>
>>>> wrote:
>>>>
>>>> Totally!
>>>>
>>>> On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote:
>>>>>
>>>>> Ah cool, didn’t realize there was a neutral definition.  We should
>>>>> call that the ‘formal specification’ then.
>>>>>
>>>>> ...of course, now that I've said that, cosmic irony demands we end up
>>>>> implementing in C, or something.
>>>>>
>>>>> Hm, a UDF that does this rather than a Hive query would probably be
>>>>> better.  E.g.
>>>>>
>>>>>   SELECT
>>>>>     request_qualifier(uri_host),
>>>>>     count(*)
>>>>>   FROM
>>>>>     wmf_raw.webrequest
>>>>>   WHERE
>>>>>     is_pageview(uri_host, uri_path, http_status, content_type)
>>>>>   GROUP BY
>>>>>     request_qualifier(uri_host)
>>>>>   ;
>>>>>
>>>>>
>>>>> Or something like that.
>>>>>
>>>>> -Ao
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote:
>>>>>
>>>>> It's totally tech-agnostic; the neutral definition is on meta. The
>>>>> hive query is just because, since we suspect that's how we'll be 
>>>>> generating
>>>>> the data, it makes sense to turn the draft def into HQL for exploratory
>>>>> queries and testing.
>>>>>
>>>>> ...of course, now that I've said that, cosmic irony demands we end up
>>>>> implementing in C, or something.
>>>>>
>>>>> On 15 December 2014 at 13:46, Toby Negrin <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> I think the hive code is "representative" in that it's an
>>>>>> implementation. It's certainly not the only permitted one.
>>>>>>
>>>>>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  We're moving forward to generate Hive queries that will represent
>>>>>> the formal specification.
>>>>>>
>>>>>> Should a specific implementation (e.g. Hive) represent the formal
>>>>>> specification?  I tend to think it should be tech-agnostic, no?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Toby, that's right.  We're moving forward to generate Hive queries
>>>>>> that will represent the formal specification.
>>>>>>
>>>>>> -Aaron
>>>>>>
>>>>>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> We've written the draft Hive queries and I'm reviewing them with
>>>>>>> Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure 
>>>>>>> we'll
>>>>>>> work it through :).
>>>>>>>
>>>>>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Aaron, all --
>>>>>>>>
>>>>>>>> I haven't seen any discussion on this which is a sign that we can
>>>>>>>> forward with turning over the draft. Thoughts?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>>
>>>>>>>> -Toby
>>>>>>>>
>>>>>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey folks,
>>>>>>>>>
>>>>>>>>> As discussions on the new page view definition have been calming
>>>>>>>>> down, we're preparing to deliver a draft version to the Devs.  I want 
>>>>>>>>> to
>>>>>>>>> make sure that we all know the status and that any substantial 
>>>>>>>>> concerns are
>>>>>>>>> raised before we hand things off on *Friday, Dec 12th.*
>>>>>>>>>
>>>>>>>>> For this phase, we are delivering the general filter[1].  This is
>>>>>>>>> the highest level filter, and exists primarily to distinguish requests
>>>>>>>>> worthy of further evaluation. Our plan is to take the definition as it
>>>>>>>>> exists on the 12th, and begin generating high-level aggregate numbers 
>>>>>>>>> based
>>>>>>>>> on it. In future iterations, we will be digging into different 
>>>>>>>>> breakdowns
>>>>>>>>> of this metric, and iterating on it to handle any inconsistencies or
>>>>>>>>> unexpected results.  There's a few differences from Web Stat 
>>>>>>>>> Collector's
>>>>>>>>> (WSC) version of the general filter that we want to call to your 
>>>>>>>>> attention
>>>>>>>>> to.
>>>>>>>>>
>>>>>>>>>    - We include searches -- WSC explicitly excludes them.
>>>>>>>>>    - We include Apps traffic -- WSC does not detect Apps traffic
>>>>>>>>>    - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/,
>>>>>>>>>    /sr-ec/) -- WSC hardcodes "/wiki/"
>>>>>>>>>    - We don't include Banner impressions -- WSC includes them.
>>>>>>>>>
>>>>>>>>> There are also some known issues with the new definition that are
>>>>>>>>> worth your notice:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. *Internal traffic is counted*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Note that WSC filters some internal traffic by hardcoding a
>>>>>>>>>    set of IPs in the definition.  We are working on parsing puppet 
>>>>>>>>> templates
>>>>>>>>>    in order to automatically detect which IPs represent internal 
>>>>>>>>> traffic.
>>>>>>>>>    This will be a /better/ solution, but it's not quite ready yet 
>>>>>>>>> because
>>>>>>>>>    parsing puppet is hard.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. *Spider traffic is counted*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - We will be using the User-agent field to detect and flag
>>>>>>>>>    spider-based traffic.  This "tag definition" will be delivered in a
>>>>>>>>>    subsequent definition.  This actually matches WSC, which does not 
>>>>>>>>> filter
>>>>>>>>>    spider for the high-level metrics.
>>>>>>>>>
>>>>>>>>> These are problems we're aware of, and will be factoring in as we
>>>>>>>>> go forward with our next task: refining the definition using real,
>>>>>>>>> hourly-level traffic data. Thanks to everyone who has given feedback 
>>>>>>>>> and
>>>>>>>>> participated in the process thus far, particularly Nemo, Erik, and
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> 1.
>>>>>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>>>>>>>
>>>>>>>>> -Aaron & Oliver
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Oliver Keyes
>>>>>>> Research Analyst
>>>>>>> Wikimedia Foundation
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Oliver Keyes
>>>>> Research Analyst
>>>>> Wikimedia Foundation
>>>>>  _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Keyes
>>>> Research Analyst
>>>> Wikimedia Foundation
>>>>  _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to