>________________________________
> De: Dario Taraborelli <[email protected]>
>Para: Felipe Ortega <[email protected]>; A mailing list for the 
>Analytics Team at WMF and everybody who has an interest in Wikipedia and 
>analytics. <[email protected]> 
>CC: Aaron Halfaker <[email protected]>; Wikimedia Labs 
><[email protected]> 
>Enviado: Viernes 14 de febrero de 2014 18:48
>Asunto: Re: [Analytics] [Labs-l]  User registration date on DB replicas
> 
>
>
>Felipe, for some context on the work the team is doing on standardizing user 
>class definitions and supportive analysis, check out: 
>https://meta.wikimedia.org/wiki/Research:Newly_registered_user
>

Thanks a lot, Dario. This simplifies things a lot, as I already have the 
logging table imported for all Wikipedias in the study.

BTW, regarding the graphs at the end of that page, I have instantly recognized 
the plots from the stl() function in R. Did you used s.window = 'periodic' in 
the call? The loess method is fine for a first approximation, but the (daily?) 
time-series are fairly noisy in this case, and it may be quite sensitive to the 
selected window span. Residuals have some noticeable patterns, e.g. in the case 
of Spanish (not a good thing).


I'm also adding a comment on the talk page regarding a 4th type of entries for 
log_type='newusers' in logging. At least in German (maybe also in other DBs), 
there are > 80K entries with log_action='newusers' (yes, same as log_type). It 
shouldn't make a great difference, but mostly for completeness in case 
description.

Best,
Felipe.


>
>On Feb 14, 2014, at 9:27 AM, Felipe Ortega <[email protected]> wrote:
>
>Hello all.
>>
>>@Tim: By "feature" I mean having values for column user.user_registration 
>>filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has 
>>suggested, I don't see any reason for this info not being available, as it is 
>>already public from Special:ListUsers.
>>
>>@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In 
>>fact, I suspect that daily or weekly aggregates would be enough for 
>>time-series characterization. My actual goal is comparing trends between 
>>different languages, and eventually correlation with other known activity 
>>metrics.
>>
>>Best regards,
>>Felipe.
>>
>>
>>
>>
>>
>>
>>El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker 
>><[email protected]> escribió:
>> 
>>I have a dataset containing estimated registration dates for editors who 
>>registered before Dec. 2005.  My method assumes that user_id is monotonically 
>>increasing and sets the lowest upper-bound available.  
>>>
>>>
>>>For example.  Let's assume the following rows:
>>>
>>>
>>>    user_id    first_edit
>>>    12345      20040102030405  
>>>    12344      NULL
>>>    12343      20040102050102
>>>
>>>
>>>Since an editor couldn't have saved a revision before registering their 
>>>account, we can assume that user 12345 registered there account on or before 
>>>20040102030405.  If user_id is monotonically increasing, we also know that 
>>>user 12344 must have registered on or before 20040102030405, which lets us 
>>>fill in a NULL.  Similarly, we have a first_edit timestamp for user 12343, 
>>>but that edit happened pretty late.  We can actually just continue to 
>>>propagate the 20040102030405timestamp to this user too.
>>>
>>>
>>>After performing this approximation, we'd have the following rows:
>>>
>>>
>>>    user_id    first_edit        user_registration_approx
>>>    12345      20040102030405    20040102030405
>>>    12344      NULL              20040102030405
>>>    12343      20040102050102    20040102030405
>>>
>>>
>>>In effect, this is similar to the approximation discussed in 
>>>https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to 
>>>interpolate probable registration timings on users.  In practice we're 
>>>talking about a difference of seconds, so I haven't bothered with the extra 
>>>work.  
>>>
>>>
>>>I'm generating a datafile for English now that I should be able to share the 
>>>the end of the day:
>>>    * user_id
>>>    * registration_type  (see 
>>>https://meta.wikimedia.org/wiki/Research:Attached_user and 
>>>https://meta.wikimedia.org/wiki/Research:Newly_registered_user)
>>>    * user_registration (from user table)
>>>    * first_edit (lowest timestamp from "revision" and "archive" for user_id)
>>>    * registration_approx (my approximation based on the method described 
>>>above)
>>>-Aaron
>>>
>>>
>>>
>>>On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <[email protected]> 
>>>wrote:
>>>
>>>Felipe Ortega, 14/02/2014 12:05:
>>>>
>>>>
>>>>Thanks a lot. Then, I look forward to the confirmation and
>>>>>implementation of this feature. In case it's better to open a new issue
>>>>>on bugzilla or any other action on my side (lend a hand with value
>>>>>reviewing/testing) just let me know.
>>>>>
>>>>
You could help assess the correctness of and/or code the guesstimate method 
proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the 
script to fill further blanks.
>>>>
>>>>
>>>>Nemo
>>>>
>>>>_______________________________________________
>>>>Labs-l mailing list
>>>>[email protected]
>>>>https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>
>>>
>>>
>>>_______________________________________________
>>Analytics mailing list
>>[email protected]
>>https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to