Hi Steve and Matthew, Very helpful solutions indeed! Thanks a lot.
I played around with all your valuable suggestions a little. To me it seems that, the simplest one step solution that would handle ties the way I had hoped for is: DT[, cmx := rank(p,ties.method="max"), by=grp] --Philip On Nov 19, 2012, at 11:54 PM, Matthew Dowle wrote: > On 18.11.2012 20:03, Steve Lianoglou wrote: >> Hi, >> >> On Sun, Nov 18, 2012 at 11:19 AM, Philip de Witt Hamer >> <[email protected]> wrote: >>> Dear all, >>> >>> data.table is great! thanks for this life(time)saving package. >>> >>> Now, I run into a difficult nut to crack using ':='. >>> I'd like to do a calculation using column information conditional on another >>> column >>> >>> first some jumbo data: >>> >>> library(data.table) >>> DT <- data.table( >>> 1:50, >>> rep(1:5,each=10), >>> runif(50,0,1) >>> ) >>> setnames(DT, 1:3, c("id","grp","p")) >>> >>> id's are unique >>> grp's speaks for itself >>> think of p's as e.g. p-values >>> >>> next, if I want to obtain the nr of p values at least as extreme as the p of >>> each row from the whole set, this seems to work well: >>> >>> DT[,c1 := sum(DT[,p] <= p), by=id] >>> >>> but then, I would like to get the nr of p values at least as extreme as the >>> p of each row for the subset with identical grp, I am having a hard time, >>> because these attempts fail: >>> >>> DT[,c2 := sum(DT[grp,p] <= p),by=id] >>> DT[,c3 := sum(DT[DT[,grp]==grp,p] <= p), by=id] >> >> You will want to group by "grp". >> >> This gets you pretty close -- it fails the "ties" criterion: >> >> DT[, cg := rank(p) - 1, by=grp] >> >> If you *really* want to keep the ties criterion, perhaps here's a way >> to do so by avoiding a for loop: >> >> DT[, cgo := rowSums(outer(p, p, '-') > 0), by=grp] >> >> The problem is that if your groups are very large, the `outer` call >> might chew lots of RAM, since you'll be creating a p x p matrix (per >> group). >> >> Does that get you where you need to be? >> >> -steve > > > Grouping by grp feels right to me, too. How about : > > setkey(DT,grp,p) > > and then using the ordered p within each group : > > DT[,c1:=seq_len(.N),by=grp] > DT[,c1:=max(c1),by='grp,p'] # to deal with ties > > NB: data.table grouping of numerics is machine tolerance aware. So > this ties treatment is more like sum(DT[,p] <= p+tol) which may or > may not be what you need. tol = .Machine$double.eps ^ 0.5. > > Or, staying with the self join approach, one trick for the scoping > issue you hit is : > > DT[,c3:={i=list(grp);sum(DT[i,p]<=p)},by=id] > > Where the DT[i,...] part relies on the fact that single name i is evaluated > in calling scope. > > Or another way in one step is : > > DT[,c3:=sum(DT[eval(.(grp)),p]<=p),by=id] > > which uses the feature that eval() is already like what ..() will do in > future. > > But grouping by grp should be much faster and cleaner, if possible. > > Matthew > > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
