Great!

On Wed, Jan 7, 2015 at 5:49 PM, Andrew Otto <[email protected]> wrote:

> I am not sure if this is quite what you are asking but just in case:
>
> For streaming is probably easier for you to use the newly created
> webrequest tables:
>
> For Hadoop Streaming, it’ll be a little annoying.  This new data is in
> Parquet.  Hadoop Streaming is still using the old MapReduce 1 API, and most
> of the officially supported Parquet input formats are for MapReduce 2 API,
> so by default Parquet and Hadoop Streaming are incompatible.
>
> However!  Some guy already ran into this problem and wrote this:
>
>
> https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net/iponweb/hadoop/streaming/parquet/ParquetAsJsonInputFormat.java
>
>
>
> On Jan 7, 2015, at 18:40, Nuria Ruiz <[email protected]> wrote:
>
> I am not sure if this is quite what you are asking but just in case:
>
> For streaming is probably easier for you to use the newly created
> webrequest tables:
>
>
> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.28s.29
>
> Those include an isPageview field so requests are pre-classified. You will
> need to wait a bit as data for those tables is being populated starting
> today.
>
>
>
> On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker <[email protected]>
> wrote:
>
>> Cool!  Let's say I want to review the filters and apply them in a python
>> script.  What should I reference?
>>
>> On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes <[email protected]>
>> wrote:
>>
>>> I'm pleased to say we now have the prototype pageviews definition as a
>>> UDF!
>>>
>>> For those with cluster access:
>>>
>>> CREATE TEMPORARY FUNCTION pageview as
>>> 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
>>>
>>> ...and then just apply it. It outputs a boolean, so you can easily go
>>> WHERE is.Pageview(fields) and treat it as a conditional. Great
>>> success!
>>>
>>> What this means for the definition is twofold; it means it's a lot
>>> easier to tests it accuracy, and it means that it's a lot easier to
>>> make sure we're all using the same definition going forward. Once we
>>> have the legacy definition as a UDF, refining and testing will proceed
>>> at great speed, although I encourage anyone with time on their hands
>>> who wants to help out to do some testing of their own :)
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to