I think we can also probably consider doing the parsing in EL/MySQL so the
user agent is never raw on tables but rather always parsed. We could use
the python ua parser library and results should be identical to the ones we
have on Hive.

Thanks,

Nuria


On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto <[email protected]> wrote:

> I’ve added an example to https://wikitech.wikimedia.org/wiki/Analytics/
> EventLogging#Hive on how to use the UAParserUDF and the Hive
> get_json_object function to work with a user_agent_map.
>
> Unfortunately we can’t manage tables in Hive for every EventLogging
> schema/revision like we do in MySQL.  So, you have to create your own
> table. It *should* be possible to specify the schema and use
> the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.
>
> Hope that helps!
>
> On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> Just a heads up:
>>
>> user_agent field is a PII field (privacy sensitive), and as such it is
>> purged after 90 days. If there would be a user_agent_map field, it should
>> be purged after 90 days as well.
>>
>> Another more permanent option might be to detect the browser family on
>> the JavaScript client with i.e. duck-typing[1] and send it as part of the
>> explicit schema. The browser family by itself is not identifying enough to
>> be considered PII, and could be kept indefinitely.
>>
>> [1] http://stackoverflow.com/questions/9847580/how-to-detect
>> -safari-chrome-ie-firefox-and-opera-browser
>>
>> On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell <[email protected]> wrote:
>>
>>> It's not just a question of which value to choose, but also how to sort.
>>> It would be nice to be able to choose sorting in alphabetical order vs
>>> numerical order. It would also be nice to assign a default sort to any item
>>> label that is taken from the Wikipedia {{DEFAULTSORT}} template (though
>>> that won't work for items without a Wikipedia article).
>>>
>>> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <
>>> [email protected]> wrote:
>>>
>>>> The problem with working on EL data in hive is that the schemas for the
>>>> tables can change at any point, in backwards-incompatible ways.  And
>>>> maintaining tables dynamically is harder here than in mysql world (where EL
>>>> just tries to insert, and creates the table on failure).  So, while it's
>>>> relatively easy to use ua-parser (see below), you can't easily access EL
>>>> data in hive tables.  However, we do have all EL data in hadoop, so you can
>>>> access it with Spark.  Andrew's about to answer with more details on that.
>>>> I just thought this might be useful if you sqoop EL data from mysql or
>>>> otherwise import it into a Hive table:
>>>>
>>>>
>>>> from stat1002, start hive, then:
>>>>
>>>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a
>>>> nalytics/refinery/refinery-hive-0.0.35.jar;
>>>>
>>>> CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin
>>>> ery.hive.UAParserUDF';
>>>>
>>>> select ua_parser('Wikimedia Bot');
>>>>
>>>> On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) <
>>>> [email protected]> wrote:
>>>>
>>>>> Tilman Bayer, 15/09/2016 01:21:
>>>>>
>>>>>> This came up recently with the Reading web team, for the purpose of
>>>>>> investigating whether certain issues are caused by certain browsers
>>>>>> only. But I imagine it has arisen in other places as well.
>>>>>>
>>>>>
>>>>> Definitely. https://www.mediawiki.org/wiki
>>>>> /EventLogging/UserAgentSanitization
>>>>>
>>>>> Nemo
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> *Marcel Ruiz Forns*
>> Analytics Developer
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to