Hi, many thanks for your reply, the Cross Join bit is very useful to know!
Thanks, Jelmer On Fri, Jul 15, 2011 at 20:36, Matthew Dowle <[email protected]> wrote: > Hi, > > Welcome! > > Yes, that's safe to assume. Others have asked that in the past, so I've > now added this text to ?data.table (under 'by'): "The order of the rows > within each group is preserved." > > The tmp.keys operation is a common one, so there's a short cut for it. > We call it a Cross Join. If you run the CJ() bit on it's own you'll find > it's the same as tmp.keys : > >> DT[CJ(id.vec,time.vec)] > id time value > [1,] 1 1 7 > [2,] 1 2 6 > [3,] 1 3 NA > [4,] 1 4 5 > [5,] 1 5 4 > [6,] 2 1 NA > [7,] 2 2 3 > [8,] 2 3 2 > [9,] 2 4 1 > [10,] 2 5 NA >> > > Rather than filling the gaps with NA, you can roll the prevailing > observation forward : > >> DT[CJ(id.vec,time.vec),roll=TRUE] > id time value > [1,] 1 1 7 > [2,] 1 2 6 > [3,] 1 3 6 > [4,] 1 4 5 > [5,] 1 5 4 > [6,] 2 1 NA # correct not to be 4 (new group) > [7,] 2 2 3 > [8,] 2 3 2 > [9,] 2 4 1 > [10,] 2 5 1 >> > > and that's useful because, now, instead of lagging in the way you did > it, you can just subtract one day from your dates or times, and find the > prevailing record then. It is subtly different of course, it depends on > what you want and what assumptions you make about the data. > >> DT[CJ(id.vec,time.vec-1),roll=TRUE] > id time value > [1,] 1 0 NA > [2,] 1 1 7 > [3,] 1 2 6 > [4,] 1 3 6 > [5,] 1 4 5 > [6,] 2 0 NA > [7,] 2 1 NA > [8,] 2 2 3 > [9,] 2 3 2 > [10,] 2 4 1 >> > > On your final line I didn't see why you chained it into two queries, but > fortunately I tested before posting and it seems there is a bug : > >> DT[ tmp.keys, # removed your "][," here > + list(time,value,'value.lag'=c(NA,value[-length(value)])), by=id ] > Error in abs(j) : Non-numeric argument to mathematical function > In addition: Warning message: > In is.na(j) : is.na() applied to non-(list or vector) of type 'NULL' >> > > That should work fine, but doesn't. I'll raise it as a bug ... > > So, not bad going for a first post! > > Thanks, > Matthew > > > On Fri, 2011-07-15 at 11:33 +0200, Jelmer Ypma wrote: >> Dear all, >> >> I found out about the data.table package yesterday, and I've been very >> impressed with the speed so far. >> >> I'm currently trying to calculate lagged values of a variable for some >> panel data set. The approach I take is the one below, but I'm not sure >> whether that is correct, and whether it is safe to do in this way >> (e.g. if data is re-ordered). >> >> Hopefully, one of you will be able to guide me in the right direction. >> >> Many thanks in advance, >> Jelmer >> >> # create some data, with missing periods >> DT <- data.table( 'id'=c(1,1,1,1,2,2,2), 'time'=c(1,2,4,5,2,3,4), >> 'value'=c(7,6,5,4,3,2,1) ) >> setkey( DT, id, time ) >> >> # extract id and time values to construct data.table with keys, >> including missing periods >> id.vec <- unique( DT$id ) >> time.vec <- unique( DT$time ) >> num.periods <- max( time.vec ) - min( time.vec ) + 1 >> tmp.keys <- data.table( 'id'=rep( id.vec, each=num.periods ), >> 'time'=rep( time.vec, length(id.vec) ) ) >> setkey( tmp.keys, id, time ) >> >> # show data >> DT # original data >> DT[ tmp.keys ] # fill in missing >> periods with NA >> >> # shift 'value' column by id to get lagged 'value' >> DT[ tmp.keys ][, >> list(time,value,'value.lag'=c(NA,value[-length(value)])), by=id ] >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
