Felipe, for some context on the work the team is doing on standardizing user class definitions and supportive analysis, check out: https://meta.wikimedia.org/wiki/Research:Newly_registered_user
On Feb 14, 2014, at 9:27 AM, Felipe Ortega <[email protected]> wrote: > Hello all. > > @Tim: By "feature" I mean having values for column user.user_registration > filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has > suggested, I don't see any reason for this info not being available, as it is > already public from Special:ListUsers. > > @Aaron: Thanks a lot. I belive that is a fairly decent approximation. In > fact, I suspect that daily or weekly aggregates would be enough for > time-series characterization. My actual goal is comparing trends between > different languages, and eventually correlation with other known activity > metrics. > > Best regards, > Felipe. > > > > El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker > <[email protected]> escribió: > I have a dataset containing estimated registration dates for editors who > registered before Dec. 2005. My method assumes that user_id is monotonically > increasing and sets the lowest upper-bound available. > > For example. Let's assume the following rows: > > user_id first_edit > 12345 20040102030405 > 12344 NULL > 12343 20040102050102 > > Since an editor couldn't have saved a revision before registering their > account, we can assume that user 12345 registered there account on or before > 20040102030405. If user_id is monotonically increasing, we also know that > user 12344 must have registered on or before 20040102030405, which lets us > fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, > but that edit happened pretty late. We can actually just continue to > propagate the 20040102030405 timestamp to this user too. > > After performing this approximation, we'd have the following rows: > > user_id first_edit user_registration_approx > 12345 20040102030405 20040102030405 > 12344 NULL 20040102030405 > 12343 20040102050102 20040102030405 > > In effect, this is similar to the approximation discussed in > https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to > interpolate probable registration timings on users. In practice we're > talking about a difference of seconds, so I haven't bothered with the extra > work. > > I'm generating a datafile for English now that I should be able to share the > the end of the day: > user_id > registration_type (see > https://meta.wikimedia.org/wiki/Research:Attached_user and > https://meta.wikimedia.org/wiki/Research:Newly_registered_user) > user_registration (from user table) > first_edit (lowest timestamp from "revision" and "archive" for user_id) > registration_approx (my approximation based on the method described above) > -Aaron > > > On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <[email protected]> > wrote: > Felipe Ortega, 14/02/2014 12:05: > > Thanks a lot. Then, I look forward to the confirmation and > implementation of this feature. In case it's better to open a new issue > on bugzilla or any other action on my side (lend a hand with value > reviewing/testing) just let me know. > > You could help assess the correctness of and/or code the guesstimate method > proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the > script to fill further blanks. > > > Nemo > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
