OK, so the dataset I described above will be located here within a few minutes: http://stat1001.wikimedia.org/public-datasets/analytics/new_user_info.enwiki.tsv
However, there's an issue I didn't forsee. It looks like some rows in the archive table have some dubious timestamps and are causing problems with relying on first_edit. I think I'm going to take another pass where I disregard archive edits to see if it ends up producing a more sane result. -Aaron On Fri, Feb 14, 2014 at 11:48 AM, Dario Taraborelli < [email protected]> wrote: > Felipe, for some context on the work the team is doing on standardizing > user class definitions and supportive analysis, check out: > https://meta.wikimedia.org/wiki/Research:Newly_registered_user > > On Feb 14, 2014, at 9:27 AM, Felipe Ortega <[email protected]> > wrote: > > Hello all. > > @Tim: By "feature" I mean having values for column user.user_registration > filled for DB replicas accessible from Tool-Labs, if possible. As Oliver > has suggested, I don't see any reason for this info not being available, as > it is already public from Special:ListUsers. > > @Aaron: Thanks a lot. I belive that is a fairly decent approximation. In > fact, I suspect that daily or weekly aggregates would be enough for > time-series characterization. My actual goal is comparing trends between > different languages, and eventually correlation with other known activity > metrics. > > Best regards, > Felipe. > > > > El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker < > [email protected]> escribió: > > I have a dataset containing estimated registration dates for editors who > registered before Dec. 2005. My method assumes that user_id is > monotonically increasing and sets the lowest upper-bound available. > > For example. Let's assume the following rows: > > user_id first_edit > 12345 20040102030405 > 12344 NULL > 12343 20040102050102 > > Since an editor couldn't have saved a revision before registering their > account, we can assume that user 12345 registered there account on or > before 20040102030405. If user_id is monotonically increasing, we also > know that user 12344 must have registered on or before 20040102030405, > which lets us fill in a NULL. Similarly, we have a first_edit timestamp > for user 12343, but that edit happened pretty late. We can actually just > continue to propagate the 20040102030405 timestamp to this user too. > > After performing this approximation, we'd have the following rows: > > user_id first_edit user_registration_approx > 12345 20040102030405 20040102030405 > 12344 NULL 20040102030405 > 12343 20040102050102 20040102030405 > > In effect, this is similar to the approximation discussed in > https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying > to interpolate probable registration timings on users. In practice we're > talking about a difference of seconds, so I haven't bothered with the extra > work. > > I'm generating a datafile for English now that I should be able to share > the the end of the day: > > - user_id > - registration_type (see > https://meta.wikimedia.org/wiki/Research:Attached_user and > https://meta.wikimedia.org/wiki/Research:Newly_registered_user) > - user_registration (from user table) > - first_edit (lowest timestamp from "revision" and "archive" for > user_id) > - registration_approx (my approximation based on the method described > above) > > -Aaron > > > On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) > <[email protected]>wrote: > > Felipe Ortega, 14/02/2014 12:05: > > Thanks a lot. Then, I look forward to the confirmation and > implementation of this feature. In case it's better to open a new issue > on bugzilla or any other action on my side (lend a hand with value > reviewing/testing) just let me know. > > > You could help assess the correctness of and/or code the guesstimate > method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , > for the script to fill further blanks. > > > Nemo > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
