Great! On Wed, Jan 7, 2015 at 5:49 PM, Andrew Otto <[email protected]> wrote:
> I am not sure if this is quite what you are asking but just in case: > > For streaming is probably easier for you to use the newly created > webrequest tables: > > For Hadoop Streaming, it’ll be a little annoying. This new data is in > Parquet. Hadoop Streaming is still using the old MapReduce 1 API, and most > of the officially supported Parquet input formats are for MapReduce 2 API, > so by default Parquet and Hadoop Streaming are incompatible. > > However! Some guy already ran into this problem and wrote this: > > > https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net/iponweb/hadoop/streaming/parquet/ParquetAsJsonInputFormat.java > > > > On Jan 7, 2015, at 18:40, Nuria Ruiz <[email protected]> wrote: > > I am not sure if this is quite what you are asking but just in case: > > For streaming is probably easier for you to use the newly created > webrequest tables: > > > https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.28s.29 > > Those include an isPageview field so requests are pre-classified. You will > need to wait a bit as data for those tables is being populated starting > today. > > > > On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker <[email protected]> > wrote: > >> Cool! Let's say I want to review the filters and apply them in a python >> script. What should I reference? >> >> On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes <[email protected]> >> wrote: >> >>> I'm pleased to say we now have the prototype pageviews definition as a >>> UDF! >>> >>> For those with cluster access: >>> >>> CREATE TEMPORARY FUNCTION pageview as >>> 'org.wikimedia.analytics.refinery.hive.isPageviewUDF'; >>> >>> ...and then just apply it. It outputs a boolean, so you can easily go >>> WHERE is.Pageview(fields) and treat it as a conditional. Great >>> success! >>> >>> What this means for the definition is twofold; it means it's a lot >>> easier to tests it accuracy, and it means that it's a lot easier to >>> make sure we're all using the same definition going forward. Once we >>> have the legacy definition as a UDF, refining and testing will proceed >>> at great speed, although I encourage anyone with time on their hands >>> who wants to help out to do some testing of their own :) >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
