>________________________________ > De: Dario Taraborelli <[email protected]> >Para: Felipe Ortega <[email protected]>; A mailing list for the >Analytics Team at WMF and everybody who has an interest in Wikipedia and >analytics. <[email protected]> >CC: Aaron Halfaker <[email protected]>; Wikimedia Labs ><[email protected]> >Enviado: Viernes 14 de febrero de 2014 18:48 >Asunto: Re: [Analytics] [Labs-l] User registration date on DB replicas > > > >Felipe, for some context on the work the team is doing on standardizing user >class definitions and supportive analysis, check out: >https://meta.wikimedia.org/wiki/Research:Newly_registered_user >
Thanks a lot, Dario. This simplifies things a lot, as I already have the logging table imported for all Wikipedias in the study. BTW, regarding the graphs at the end of that page, I have instantly recognized the plots from the stl() function in R. Did you used s.window = 'periodic' in the call? The loess method is fine for a first approximation, but the (daily?) time-series are fairly noisy in this case, and it may be quite sensitive to the selected window span. Residuals have some noticeable patterns, e.g. in the case of Spanish (not a good thing). I'm also adding a comment on the talk page regarding a 4th type of entries for log_type='newusers' in logging. At least in German (maybe also in other DBs), there are > 80K entries with log_action='newusers' (yes, same as log_type). It shouldn't make a great difference, but mostly for completeness in case description. Best, Felipe. > >On Feb 14, 2014, at 9:27 AM, Felipe Ortega <[email protected]> wrote: > >Hello all. >> >>@Tim: By "feature" I mean having values for column user.user_registration >>filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has >>suggested, I don't see any reason for this info not being available, as it is >>already public from Special:ListUsers. >> >>@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In >>fact, I suspect that daily or weekly aggregates would be enough for >>time-series characterization. My actual goal is comparing trends between >>different languages, and eventually correlation with other known activity >>metrics. >> >>Best regards, >>Felipe. >> >> >> >> >> >> >>El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker >><[email protected]> escribió: >> >>I have a dataset containing estimated registration dates for editors who >>registered before Dec. 2005. My method assumes that user_id is monotonically >>increasing and sets the lowest upper-bound available. >>> >>> >>>For example. Let's assume the following rows: >>> >>> >>> user_id first_edit >>> 12345 20040102030405 >>> 12344 NULL >>> 12343 20040102050102 >>> >>> >>>Since an editor couldn't have saved a revision before registering their >>>account, we can assume that user 12345 registered there account on or before >>>20040102030405. If user_id is monotonically increasing, we also know that >>>user 12344 must have registered on or before 20040102030405, which lets us >>>fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, >>>but that edit happened pretty late. We can actually just continue to >>>propagate the 20040102030405timestamp to this user too. >>> >>> >>>After performing this approximation, we'd have the following rows: >>> >>> >>> user_id first_edit user_registration_approx >>> 12345 20040102030405 20040102030405 >>> 12344 NULL 20040102030405 >>> 12343 20040102050102 20040102030405 >>> >>> >>>In effect, this is similar to the approximation discussed in >>>https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to >>>interpolate probable registration timings on users. In practice we're >>>talking about a difference of seconds, so I haven't bothered with the extra >>>work. >>> >>> >>>I'm generating a datafile for English now that I should be able to share the >>>the end of the day: >>> * user_id >>> * registration_type (see >>>https://meta.wikimedia.org/wiki/Research:Attached_user and >>>https://meta.wikimedia.org/wiki/Research:Newly_registered_user) >>> * user_registration (from user table) >>> * first_edit (lowest timestamp from "revision" and "archive" for user_id) >>> * registration_approx (my approximation based on the method described >>>above) >>>-Aaron >>> >>> >>> >>>On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <[email protected]> >>>wrote: >>> >>>Felipe Ortega, 14/02/2014 12:05: >>>> >>>> >>>>Thanks a lot. Then, I look forward to the confirmation and >>>>>implementation of this feature. In case it's better to open a new issue >>>>>on bugzilla or any other action on my side (lend a hand with value >>>>>reviewing/testing) just let me know. >>>>> >>>> You could help assess the correctness of and/or code the guesstimate method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill further blanks. >>>> >>>> >>>>Nemo >>>> >>>>_______________________________________________ >>>>Labs-l mailing list >>>>[email protected] >>>>https://lists.wikimedia.org/mailman/listinfo/labs-l >>>> >>> >>> >>>_______________________________________________ >>Analytics mailing list >>[email protected] >>https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
