Hi Matthew, Thanks for the suggestions. The tapply in the code below transforms the table from long format to a wide format with wdpaint as columns and pnvid as rows. The main reason is that it includes all combinations of the two variables, including those with 0 observations. The code you are suggesting indeed seems to be the same as ordering the table.
Cheers, Paulo On Fri, Jun 22, 2012 at 4:55 PM, Matthew Dowle <[email protected]>wrote: > > Great. Thanks for keeping the list updated. > > One thing I don't quite see, instead of : > > for (i in 1:12) { > a3 <- a1[,V1:=sample(a2,replace=F)] > b <- a3[,.N,by=list(V1,V2)] > c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) > } > > why not : > > for (i in 1:12){ > a3 <- a1[,V1:=sample(a2,replace=F)] > b <- a3[,.N,by=list(V1,V2)] > b2 <- b[,sum(N),by=list(V2,V1)] > c[[i]] <- b2$V1 > } > > Idea being to save the tapply and the 2 as.factor. Further, I'm not sure > that sum() will be summing anything will it? Isn't b2 the same as > b[order(V2,V1)], and if so that will be faster still? > > Matthew > > > I got some very useful further feed back from Matthew. Let me summarize > > some key points from his suggestions concerning the code below: > > > > The following code is still fairly slow (although faster then using > > table or tapply): > > > > a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid) > > > > b <- a[,.N,by=list(V1,V2)] > > > > c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) > > > > for(i in 1:11){ > > > > a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid) > > > > b <- a[,.N,by=list(V1,V2)] > > > > c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), > sum)) > > > > } > > > > As pointed out by Matthew, the rbind at the end of the loop will be > > growing memory use and is generally inefficient. How badly it is > > impacting performance will depend on the data size though. So step 1 is > > to get that outside the loop (an useful link he provided is > > > http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r > ). > > Based on a hint in R-inferno > > (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code > > as follows: > > > > c <- vector('list', 12) > > > > a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid)) > > > > a2 <- as.integer(SPFn$wdpaint) > > > > for(i in 1:12){ > > > > a3 <- a1[,V1:=sample(a2,replace=F)] > > > > b <- a3[,.N,by=list(V1,V2)] > > > > c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) > > > > } > > > > c <- do.call('rbind', c) > > > > This did improve the run time, but only very little bit (16.0 instead of > > 16.4 seconds). Next step was to profile the code, to see what part is > > taking most time. This can be done with Rprof(). The results showed that > > ordernumtol, a data.table function which sorts numeric ('double' > > floating point) columns was taking a lot of time. As it turns out, the > > SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to > > integer does speed up the code a lot. > > > > c <- vector('list', 12) > > > > a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid)) > > > > a2 <- as.integer(SPFn$wdpaint) > > > > for(i in 1:12){ > > > > a3 <- a1[,V1:=sample(a2,replace=F)] > > > > b <- a3[,.N,by=list(V1,V2)] > > > > c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) > > > > } > > > > c <- do.call('rbind', c) > > > > 9 > > The second code took 16.0 seconds. The last attempt 2.4 seconds only! > > That is a serious (> 6x) improvement. And it shows I really need to be > > much more careful about my variables... > > I checked and it also makes a smaller, but still very significant > > difference when using table (3x) or tapply (2x). > > > > Big thanks to Matthew Dowle for all his help.. and any further > > suggestions for improvements are obviously welcome. > > > > Cheers, > > > > Paulo > > > > > > > > On 06/19/2012 04:24 PM, Matthew Dowle wrote: > >> The shuffling can form a different number of groups can't it? > > YES, obvious.. I was half asleep I guess > >> > >> table(c(1,1,2,2), c(3,3,4,4)) # 2 groups > >> table(c(2,2,1,1), c(3,3,4,4)) # 2 groups > >> table(c(2,1,2,1), c(3,3,4,4)) # 4 groups > >> > >> > >>> Thanks Matthew > >>> > >>> I am not sure I understand the code (actually, I am sure I do not :-( . > >>> More specifically, I would expect the two expressions below to yield > >>> tables > >>> of the same dimension (basically all combinations of wdpaint and > >>> pnnid): > >>> > >>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)] > >>> dim(aa) > >>>> 254 3 > >>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid) > >>> dim(bb) > >>>> 170 3 > >>> What I am looking for is creating a cross table of pnvid and wdpaint, > >>> i.e., > >>> the frequency or number of occurrences of each combination of pnvid and > >>> wdpaint. Shuffling wdpaint should give in that case a different > >>> frequency > >>> distribution, like in the example below: > >>> > >>> table(c(1,1,2,2), c(3,3,4,4)) > >>> table(c(2,2,1,1), c(3,3,4,4)) > >>> > >>> Basically what I want to do is run X permutations on a data set which I > >>> will then use to create a confidence interval on the frequency > >>> distribution > >>> of sample points over wdpaint and pnvid > >>> > >>> Cheers, > >>> > >>> Paulo > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle > >>> <[email protected]>wrote: > >>> > >>>> Hi, > >>>> > >>>> Welcome to the list. > >>>> > >>>> Rather than picking a column and calling length() on it, .N is a > >>>> little > >>>> more convenient (and faster if that column isn't otherwise used, as in > >>>> this example). Search ?data.table for the string ".N" to find out > >>>> more. > >>>> > >>>> And to group by expressions of column names, wrap with list(). So, > >>>> > >>>> SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)] > >>>> > >>>> But that won't calculate any different statistics, just return the > >>>> groups > >>>> in a different order. Seems like just an example, rather than the real > >>>> task, iiuc, which is fine of course. > >>>> > >>>> Matthew > >>>> > >>>> > >>>>> Hi, I am new to this package and not sure how to implement the > >>>> sample() > >>>>> function with data.table. > >>>>> > >>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint. > >>>>> The > >>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am > >>>>> interested in the count of all combinations of wdpaint and pnvid in > >>>>> my > >>>>> data > >>>>> set, which can be calculated using table or tapply (I use the latter > >>>> in > >>>>> the > >>>>> example code below). > >>>>> > >>>>> Normally I would use something like: > >>>>> > >>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid), > >>>> as.factor(SPF$wdpaint), > >>>>> function(x) length(x))* > >>>>> > >>>>> If I understand correctly, I would use the below when working with > >>>> data > >>>>> tables: > >>>>> > >>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]* > >>>>> > >>>>> But what if I want to reshuffle the column wdpaint first? When using > >>>>> tapply, it would be something along the lines of: > >>>>> > >>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint, > >>>>> replace=F))) > >>>>> c <- tapply(SPF$cat, a, function(x) length(x))* > >>>>> > >>>>> > >>>>> But how to do this with data.table? > >>>>> > >>>>> Paulo > >>>>> _______________________________________________ > >>>>> datatable-help mailing list > >>>>> [email protected] > >>>>> > >>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>>> > >>>> > >>>> > >> > >> > >> > > > > > > > > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
