I’ve added an example to https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and the Hive get_json_object function to work with a user_agent_map.
Unfortunately we can’t manage tables in Hive for every EventLogging schema/revision like we do in MySQL. So, you have to create your own table. It *should* be possible to specify the schema and use the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this. Hope that helps! On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns <[email protected]> wrote: > Just a heads up: > > user_agent field is a PII field (privacy sensitive), and as such it is > purged after 90 days. If there would be a user_agent_map field, it should > be purged after 90 days as well. > > Another more permanent option might be to detect the browser family on the > JavaScript client with i.e. duck-typing[1] and send it as part of the > explicit schema. The browser family by itself is not identifying enough to > be considered PII, and could be kept indefinitely. > > [1] http://stackoverflow.com/questions/9847580/how-to- > detect-safari-chrome-ie-firefox-and-opera-browser > > On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell <[email protected]> wrote: > >> It's not just a question of which value to choose, but also how to sort. >> It would be nice to be able to choose sorting in alphabetical order vs >> numerical order. It would also be nice to assign a default sort to any item >> label that is taken from the Wikipedia {{DEFAULTSORT}} template (though >> that won't work for items without a Wikipedia article). >> >> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <[email protected] >> > wrote: >> >>> The problem with working on EL data in hive is that the schemas for the >>> tables can change at any point, in backwards-incompatible ways. And >>> maintaining tables dynamically is harder here than in mysql world (where EL >>> just tries to insert, and creates the table on failure). So, while it's >>> relatively easy to use ua-parser (see below), you can't easily access EL >>> data in hive tables. However, we do have all EL data in hadoop, so you can >>> access it with Spark. Andrew's about to answer with more details on that. >>> I just thought this might be useful if you sqoop EL data from mysql or >>> otherwise import it into a Hive table: >>> >>> >>> from stat1002, start hive, then: >>> >>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a >>> nalytics/refinery/refinery-hive-0.0.35.jar; >>> >>> CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin >>> ery.hive.UAParserUDF'; >>> >>> select ua_parser('Wikimedia Bot'); >>> >>> On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) < >>> [email protected]> wrote: >>> >>>> Tilman Bayer, 15/09/2016 01:21: >>>> >>>>> This came up recently with the Reading web team, for the purpose of >>>>> investigating whether certain issues are caused by certain browsers >>>>> only. But I imagine it has arisen in other places as well. >>>>> >>>> >>>> Definitely. https://www.mediawiki.org/wiki >>>> /EventLogging/UserAgentSanitization >>>> >>>> Nemo >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > *Marcel Ruiz Forns* > Analytics Developer > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
