Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto <[email protected]> wrote: > > This needs more testing! Validation! Etc. But woo! > https://gerrit.wikimedia.org/r/#/c/180023 > > This let’s you do: > > > > ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar; > > CREATE TEMPORARY FUNCTION is_pageview as > 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’; > > SELECT > LOWER(uri_host) as uri_host, > count(*) as pageview_count > FROM > wmf_raw.webrequest > WHERE > (webrequest_source = 'text' or webrequest_source = 'mobile') > AND year=2014 > AND month=12 > AND day=7 > AND hour=12 > AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) > GROUP BY > LOWER(uri_host) > ORDER BY pageview_count desc > LIMIT 10 > ; > > … > > uri_host pageview_count > > en.wikipedia.org 6613046 > en.m.wikipedia.org 3223273 > ru.wikipedia.org 2119850 > ja.m.wikipedia.org 1501954 > ja.wikipedia.org 1411533 > de.wikipedia.org 1330252 > zh.wikipedia.org 949228 > fr.wikipedia.org 939602 > commons.wikimedia.org 912965 > de.m.wikipedia.org 664661 > > Time taken: 94.295 seconds, Fetched: 10 row(s) > > > > On Dec 15, 2014, at 16:02, Dario Taraborelli <[email protected]> > wrote: > > Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on > with the implementation. > > On Dec 15, 2014, at 11:32 AM, Oliver Keyes <[email protected]> wrote: > > Totally! > > On 15 December 2014 at 14:22, Andrew Otto <[email protected]> wrote: >> >> Ah cool, didn’t realize there was a neutral definition. We should call >> that the ‘formal specification’ then. >> >> ...of course, now that I've said that, cosmic irony demands we end up >> implementing in C, or something. >> >> Hm, a UDF that does this rather than a Hive query would probably be >> better. E.g. >> >> SELECT >> request_qualifier(uri_host), >> count(*) >> FROM >> wmf_raw.webrequest >> WHERE >> is_pageview(uri_host, uri_path, http_status, content_type) >> GROUP BY >> request_qualifier(uri_host) >> ; >> >> >> Or something like that. >> >> -Ao >> >> >> >> >> >> >> On Dec 15, 2014, at 14:07, Oliver Keyes <[email protected]> wrote: >> >> It's totally tech-agnostic; the neutral definition is on meta. The hive >> query is just because, since we suspect that's how we'll be generating the >> data, it makes sense to turn the draft def into HQL for exploratory queries >> and testing. >> >> ...of course, now that I've said that, cosmic irony demands we end up >> implementing in C, or something. >> >> On 15 December 2014 at 13:46, Toby Negrin <[email protected]> wrote: >>> >>> I think the hive code is "representative" in that it's an >>> implementation. It's certainly not the only permitted one. >>> >>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <[email protected]> wrote: >>> >>> We're moving forward to generate Hive queries that will represent the >>> formal specification. >>> >>> Should a specific implementation (e.g. Hive) represent the formal >>> specification? I tend to think it should be tech-agnostic, no? >>> >>> >>> >>> On Dec 15, 2014, at 12:15, Aaron Halfaker <[email protected]> >>> wrote: >>> >>> Toby, that's right. We're moving forward to generate Hive queries that >>> will represent the formal specification. >>> >>> -Aaron >>> >>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <[email protected]> >>> wrote: >>> >>>> We've written the draft Hive queries and I'm reviewing them with Otto >>>> now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work >>>> it through :). >>>> >>>> On 15 December 2014 at 12:10, Toby Negrin <[email protected]> >>>> wrote: >>>>> >>>>> Hi Aaron, all -- >>>>> >>>>> I haven't seen any discussion on this which is a sign that we can >>>>> forward with turning over the draft. Thoughts? >>>>> >>>>> thanks, >>>>> >>>>> -Toby >>>>> >>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < >>>>> [email protected]> wrote: >>>>> >>>>>> Hey folks, >>>>>> >>>>>> As discussions on the new page view definition have been calming >>>>>> down, we're preparing to deliver a draft version to the Devs. I want to >>>>>> make sure that we all know the status and that any substantial concerns >>>>>> are >>>>>> raised before we hand things off on *Friday, Dec 12th.* >>>>>> >>>>>> For this phase, we are delivering the general filter[1]. This is the >>>>>> highest level filter, and exists primarily to distinguish requests worthy >>>>>> of further evaluation. Our plan is to take the definition as it exists on >>>>>> the 12th, and begin generating high-level aggregate numbers based on it. >>>>>> In >>>>>> future iterations, we will be digging into different breakdowns of this >>>>>> metric, and iterating on it to handle any inconsistencies or unexpected >>>>>> results. There's a few differences from Web Stat Collector's (WSC) >>>>>> version >>>>>> of the general filter that we want to call to your attention to. >>>>>> >>>>>> - We include searches -- WSC explicitly excludes them. >>>>>> - We include Apps traffic -- WSC does not detect Apps traffic >>>>>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) >>>>>> -- WSC hardcodes "/wiki/" >>>>>> - We don't include Banner impressions -- WSC includes them. >>>>>> >>>>>> There are also some known issues with the new definition that are >>>>>> worth your notice: >>>>>> >>>>>> >>>>>> 1. *Internal traffic is counted* >>>>>> >>>>>> >>>>>> - Note that WSC filters some internal traffic by hardcoding a set >>>>>> of IPs in the definition. We are working on parsing puppet templates >>>>>> in >>>>>> order to automatically detect which IPs represent internal traffic. >>>>>> This >>>>>> will be a /better/ solution, but it's not quite ready yet because >>>>>> parsing >>>>>> puppet is hard. >>>>>> >>>>>> >>>>>> 1. *Spider traffic is counted* >>>>>> >>>>>> >>>>>> - We will be using the User-agent field to detect and flag >>>>>> spider-based traffic. This "tag definition" will be delivered in a >>>>>> subsequent definition. This actually matches WSC, which does not >>>>>> filter >>>>>> spider for the high-level metrics. >>>>>> >>>>>> These are problems we're aware of, and will be factoring in as we go >>>>>> forward with our next task: refining the definition using real, >>>>>> hourly-level traffic data. Thanks to everyone who has given feedback and >>>>>> participated in the process thus far, particularly Nemo, Erik, and >>>>>> Christian. >>>>>> >>>>>> 1. >>>>>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>>>>> >>>>>> -Aaron & Oliver >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Research Analyst >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
