I’ve added an example to
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive on how to
use the UAParserUDF and the Hive get_json_object function to work with a
user_agent_map.

Unfortunately we can’t manage tables in Hive for every EventLogging
schema/revision like we do in MySQL.  So, you have to create your own
table. It *should* be possible to specify the schema and use
the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.

Hope that helps!

On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns <[email protected]>
wrote:

> Just a heads up:
>
> user_agent field is a PII field (privacy sensitive), and as such it is
> purged after 90 days. If there would be a user_agent_map field, it should
> be purged after 90 days as well.
>
> Another more permanent option might be to detect the browser family on the
> JavaScript client with i.e. duck-typing[1] and send it as part of the
> explicit schema. The browser family by itself is not identifying enough to
> be considered PII, and could be kept indefinitely.
>
> [1] http://stackoverflow.com/questions/9847580/how-to-
> detect-safari-chrome-ie-firefox-and-opera-browser
>
> On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell <[email protected]> wrote:
>
>> It's not just a question of which value to choose, but also how to sort.
>> It would be nice to be able to choose sorting in alphabetical order vs
>> numerical order. It would also be nice to assign a default sort to any item
>> label that is taken from the Wikipedia {{DEFAULTSORT}} template (though
>> that won't work for items without a Wikipedia article).
>>
>> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <[email protected]
>> > wrote:
>>
>>> The problem with working on EL data in hive is that the schemas for the
>>> tables can change at any point, in backwards-incompatible ways.  And
>>> maintaining tables dynamically is harder here than in mysql world (where EL
>>> just tries to insert, and creates the table on failure).  So, while it's
>>> relatively easy to use ua-parser (see below), you can't easily access EL
>>> data in hive tables.  However, we do have all EL data in hadoop, so you can
>>> access it with Spark.  Andrew's about to answer with more details on that.
>>> I just thought this might be useful if you sqoop EL data from mysql or
>>> otherwise import it into a Hive table:
>>>
>>>
>>> from stat1002, start hive, then:
>>>
>>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a
>>> nalytics/refinery/refinery-hive-0.0.35.jar;
>>>
>>> CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin
>>> ery.hive.UAParserUDF';
>>>
>>> select ua_parser('Wikimedia Bot');
>>>
>>> On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) <
>>> [email protected]> wrote:
>>>
>>>> Tilman Bayer, 15/09/2016 01:21:
>>>>
>>>>> This came up recently with the Reading web team, for the purpose of
>>>>> investigating whether certain issues are caused by certain browsers
>>>>> only. But I imagine it has arisen in other places as well.
>>>>>
>>>>
>>>> Definitely. https://www.mediawiki.org/wiki
>>>> /EventLogging/UserAgentSanitization
>>>>
>>>> Nemo
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Marcel Ruiz Forns*
> Analytics Developer
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to