Hi Dennis, I didn't see your post before I sent my latest reply.
Nice detective work! For what it's worth, from what I understand your "punchline"/kewpie-prize solution is so much faster because it avoids building the .SD data.table within each group. I'll let Matthew leave a more detailed comment, since he's (obviously) much more intimately familiar w/ the inner voodoo of data.table. But as a last comment -- if the speed differences are so drastic because of the cost of creating the .SD data.table, maybe we should think about taking some "inspiration" from plyr and define a similar `colwise` function -- which would operate across each "column" of supposedly-build .SD object applying a function to each of them w/o actually building an .SD object itself. -steve On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <[email protected]> wrote: > Hi: > > <A bunch snipped because I get the archives in digest form> > > Re Prof. Voelkel's recent posts: > > (1) Quoting does not work well in data.table; this is mentioned in > several of the FAQs. Apropos to this discussion, some of the relevant > ones include 1.2, 1.6 and 2.1; there may be others :) > > (2) Steve's response seems to be the right way to go (although see > below), but I thought I'd up the stakes a little and assume that Prof. > Voelkel has a large number of variables, only a subset of which he may > want summarized in a particular go. To that end, I created the > following toy data frame cum data.table; this is as much for my own > edification as anyone else's (which explains the eventual length of > this post...I got curious :) > > This goes against the advice given in the first example of the > data.table wiki, but if you have, say, 100 variables to select out of > a possible 1000, it doesn't make sense to list them individually as > recommended on the wiki. (But see below...) > > library('data.table') > set.seed(1043) > m <- matrix(rpois(240, 10), nrow = 6) > colnames(m) <- paste('A', 1:40, sep = '') > m <- as.data.frame(m) > dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = 'x') > dim(dt2) > # [1] 6 42 ...so far, so good > > # Subset of variables for which sums are desired > vars <- paste('A', c(1, 4, 10, 15, 31), sep = '') > > # One approach: use the select = argument of subset() to restrict > # the variables under consideration: > dt2[, lapply(subset(.SD, select = vars), sum), by = 'x'] > x A1 A4 A10 A15 A31 > [1,] 1 18 21 22 22 24 > [2,] 2 20 13 27 23 21 > [3,] 3 22 15 16 23 15 > > # Use the with = FALSE construct of data.table to do the same: > dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'] > x y A1 A4 A10 A15 A31 > [1,] 1 1 11 13 12 11 16 > [2,] 1 2 7 8 10 11 8 > [3,] 2 1 10 4 16 7 11 > [4,] 2 3 10 9 11 16 10 > [5,] 3 2 11 8 7 11 7 > [6,] 3 3 11 7 9 12 8 > > # For this example, it is the same (apart from the key variables) as > dt2[, vars, with = FALSE] > > Not bad for this small example, but what happens in a much larger data table? > > To find out, I created a 10000 x 1000 matrix that I converted into a > data table, added two grouping variables of 100 levels each and then > tried both approaches above again. Performance isn't bad when > summarizing over one variable, but there is a definite hit when two > variables are summarized. [It makes some sense since one is grouping > over 10000 level combinations rather than 100, but once again, keep > reading.] Curiously, it makes no difference if there is one key > variable or two, which made me wonder what the preferred approach is > in this circumstance. > > m <- matrix(rpois(10000000, 10), nrow = 10000) > m <- as.data.table(m) > m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100)) > setkey(m, 'x') > dim(m) > # [1] 10000 1002 > > # Randomly select 150 variables from the 1000 > vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '') > length(vars) > # [1] 150 > key(m) > # [1] "x" >> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x']) > user system elapsed > 0.75 0.00 0.75 >> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x']) > user system elapsed > 0.64 0.00 0.64 >> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y']) > user system elapsed > 53.65 0.00 53.85 >> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']) > user system elapsed > 44.21 0.01 44.35 > > m2 <- data.table(m, key = 'x, y') > rm(m) > key(m2) > # [1] "x" "y" >> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, y']) > user system elapsed > 53.54 0.00 53.73 >> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']) > user system elapsed > 44.30 0.04 44.60 > > The first question in the wiki > (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says > to use the columns directly rather than to rely on .SD. I wanted to > know how to pass new names to the summaries instead of overwriting the > original variable names. For the fun of it, I tried the following: > > select <- sample(1:1000, 150, replace = FALSE) > vars <- paste('A', select, sep = '') > outvars <- paste('S', select, sep = '') > > # Create a long expression of the form 'list(..., Sn = sum(An), ...)', > # n a subscript from 1 to 150. > expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep = > ''), sep = '=', collapse = ','), > ')', sep = '') > u <- m2[, eval(parse(text = expr)), by = 'x'] >> dim(u) > # [1] 100 151 seems reasonable... > > This seemed to run rather fast, so I decided to time it: > >> system.time(m2[, eval(parse(text = expr)), by = 'x']) > user system elapsed > 0.03 0.00 0.03 >> system.time(m2[, eval(parse(text = expr)), by = 'x, y']) > user system elapsed > 1.05 0.00 1.04 > > I've got to admit, this is not the approach I would have taken > normally, is certainly not intuitively obvious to me and flouts the > usual advice to avoid the eval(parse(text = )) mantra, but the data > don't lie :) Please tell me there's a more code-efficient way to do > this (the new variable names included), because my 'solution' was a > complete kludge and accidental kewpie prize. > > Cheers, > Dennis > >> Message: 1 >> Date: Thu, 14 Jul 2011 16:36:11 -0400 >> From: Joseph Voelkel <[email protected]> >> Subject: [datatable-help] Skipping some Vi names >> To: "[email protected]" >> <[email protected]> >> Message-ID: >> <[email protected]> >> Content-Type: text/plain; charset="us-ascii" >> >> I don't use data.table too much (though I probably should use it more...). >> >> I was surprised at the results below. It appears that the name V1 gets >> assigned to the first result, but then the keys ("in the background") are >> assigned the next set of Vi names, creating a gap in the names depending on >> the number of keys. I would like to see the Vi names appear in their >> natural, sequential, order. Not a show stopper, but it's annoying. (I have >> over 40 Vi's and it'd be good to have them numbered more rationally.) Thanks. >> >>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x") >>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"] >> x V1 V3 V4 >> [1,] 1 5 17 29 >> [2,] 2 7 19 31 >> [3,] 3 9 21 33 >>> key(dt)<-c("x","y") >>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"] >> x y V1 V4 V5 >> [1,] 1 1 1 7 13 >> [2,] 1 2 4 10 16 >> [3,] 2 1 2 8 14 >> [4,] 2 3 5 11 17 >> [5,] 3 2 3 9 15 >> [6,] 3 3 6 12 18 >> >> >> >> Joseph G. Voelkel, Ph.D. >> Professor, Center for Quality and Applied Statistics >> Kate Gleason College of Engineering >> Rochester Institute of Technology >> V 585-475-2231 >> F 585-475-5959 >> [email protected] >> > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
