If you ever used the ServerSideAccountCreation log to run queries on cross-wiki 
account registrations and ever used the event_userName field,  please be aware 
of these two issues we recently discovered. 

• Non-ASCII characters in usernames are garbled and replaced with question 
marks (we have 25K account creation events with username “???” and 21K 
registrations with username “????” just to mention the most frequent examples). 
[1] Counting usernames will underreport the actual number of accounts created, 
specifically for projects with a large proportion of non-ASCII usernames. 

• There’s a large number of new users registering with the same username on 
multiple projects, which seems to violate the principle that all new accounts 
are unified by default. These users don’t have a record in 
centralauth.globaluser and as a result they are treated as non-unified 
accounts. [2]

Because of these reasons, and until these issues are addressed, you should not 
assume that there’s a unique event per new registered user globally.

How to avoid this problem:

• Use event_userId whenever possible

• When querying across projects, make sure you JOIN globaluser to make sure you 
don’t count the same user multiple times. The new analytics-store allows you to 
do that for any MediaWiki DB or EventLogging log, which is pretty awesome.

Dario

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=66123
[2] https://bugzilla.wikimedia.org/show_bug.cgi?id=66101
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to