>I am hoping we can recover the garbled usernames from the raw JSON logs,
Please have in mind that we have logs only from the last 90 days.
Now, we shall be able to recover from the logs the user_names with
character set utf-8. Note that the encoding issue does not apply only just
to user names but actually to any string who can have a non-asciii value in
all event logging schemas, not just this one.
See, for example, the following record from the logs:
{"clientValidated": true, "event": {"campaign": "", "displayMobile": true,
"isSelfMade": true, "*returnTo*":
*"\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a",
*"token": "", "userBuckets": "", "userId": 725222, "userName":
"<removed>"}, "recvFrom": "mw1087", "revision": 5487345, "schema":
"ServerSideAccountCreation", "seqId": 53258317, "timestamp": 1389610463,
"uuid": "013953cf77a2585e983b491f2d4a2388", "webHost": "ar.wikipedia.org",
"wiki": "arwiki"}
Encoding in python2 is a notorious pain and hard to get right so to fixing
this will mean not just "restoring" records from logs but also it involves
changing database connection args, bindings and database types. Not a huge
deal, but I just want to point out that fixing the issue goes beyond
repopulating the records.
On Fri, Jun 6, 2014 at 1:39 AM, Dario Taraborelli <
[email protected]> wrote:
> I am hoping we can recover the garbled usernames from the raw JSON logs,
> but you’re correct about username changes. For project level counts,
> though, they should not dramatically affect the accuracy of new
> registration numbers.
>
> On Jun 5, 2014, at 3:51 PM, Aaron Halfaker <[email protected]>
> wrote:
>
> Regretfully, looking up a user in Centralauth requires the use of a
> username. Then again, you'd need to join with a user table (with user_id)
> anyway since users can be renamed after they create their account and that
> name change won't be reflected in ServerSideAccountCreation.
>
>
> On Thu, Jun 5, 2014 at 5:47 PM, Steven Walling <[email protected]>
> wrote:
>
>>
>> On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli <
>> [email protected]> wrote:
>>
>>>
>>> • Use event_userId whenever possible
>>
>>
>> This is really a best practice everyone should follow in all analysis.
>> Unless you're qualitatively interested in the contents of usernames, any
>> analysis that uses unique names instead of ids should probably be treated
>> as highly suspect.
>>
>>
>> --
>> Steven Walling,
>> Product Manager
>> https://wikimediafoundation.org/
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics