And I just posted something that nicely used colwise() in conjunction with a vector of variable names on R-help just a few minutes ago. A look at the colwise() help page shows that there are several ways to input a variable list for use with colwise(): a bquoted, comma-separated, unquoted string of variables (e.g., .(A, B, C)), a one-sided formula interface or a vector of (quoted) variable names, which was my particular concern. I was ready to recant my assertion, but you guys are too quick and sharp for me today :) So much good stuff going on in the R-related lists the past two days to which I can either contribute or learn from...
Back to the point, an efficient variable selection mechanism in conjunction with a processing function that could optionally take a user-contributed (anonymous) function would be a welcome feature in data.table. Cheers, Dennis On Fri, Jul 15, 2011 at 11:29 AM, Chris Neff <[email protected]> wrote: > Just chiming in to say something similar to colwise from plyr would be quite > nice. You could just carry around a vector of variable names, then do > something DT[ ,colwise( f, var_names), by=by_names ]. > > On 15 July 2011 14:06, Dennis Murphy <[email protected]> wrote: >> >> On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou >> <[email protected]> wrote: >> > Hi Dennis, >> > >> > I didn't see your post before I sent my latest reply. >> > >> > Nice detective work! >> >> Thanks, Steve. I just followed my nose and the docs, which I have >> conveniently kept in a small binder for such occasions :) Like JV, I >> don't use data.table every day, so some of its idiosyncracies get >> cobwebbed in the hard drive over time. The wiki entries helped a lot. >> >> > >> > For what it's worth, from what I understand your >> > "punchline"/kewpie-prize solution is so much faster because it avoids >> > building the .SD data.table within each group. >> >> That was my deduction from having read the first entry in the wiki. I >> still can't believe I got that thing to work :) >> >> > >> > I'll let Matthew leave a more detailed comment, since he's (obviously) >> > much more intimately familiar w/ the inner voodoo of data.table. But >> > as a last comment -- if the speed differences are so drastic because >> > of the cost of creating the .SD data.table, maybe we should think >> > about taking some "inspiration" from plyr and define a similar >> > `colwise` function -- which would operate across each "column" of >> > supposedly-build .SD object applying a function to each of them w/o >> > actually building an .SD object itself. >> >> Your clairvoyance skills are clearly operating today :) More >> seriously, this is what I would consider an 'obvious' "big-data" >> problem - I could easily see situations arising in finance and genomic >> applications where a fairly large subset of variables of the same >> type, but not necessarily all of them, need to be summarized in a >> particular way. The colwise() functions would be problematic as well >> in the scenario described in my eariler post, but I haven't tried >> ddply() to verify that assertion so I could be mistaken. >> >> It would be *really* helpful to have a convenient, fast mechanism in >> data.table that allows one to substitute a (possibly large) vector of >> variable names into a function. Alas, I don't have any bright ideas >> about how to program it. Fortunately, there are some nice functions in >> R to select variable subsets efficiently in data frames (e.g., the >> grep() family of functions, regular expressions, %in% and so on), but >> I don't know how that would translate easily to data.table() since the >> internals are so different. >> >> Looking forward to the team's take on this... >> >> Dennis >> >> > >> > -steve >> > >> > On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <[email protected]> >> > wrote: >> >> Hi: >> >> >> >> <A bunch snipped because I get the archives in digest form> >> >> >> >> Re Prof. Voelkel's recent posts: >> >> >> >> (1) Quoting does not work well in data.table; this is mentioned in >> >> several of the FAQs. Apropos to this discussion, some of the relevant >> >> ones include 1.2, 1.6 and 2.1; there may be others :) >> >> >> >> (2) Steve's response seems to be the right way to go (although see >> >> below), but I thought I'd up the stakes a little and assume that Prof. >> >> Voelkel has a large number of variables, only a subset of which he may >> >> want summarized in a particular go. To that end, I created the >> >> following toy data frame cum data.table; this is as much for my own >> >> edification as anyone else's (which explains the eventual length of >> >> this post...I got curious :) >> >> >> >> This goes against the advice given in the first example of the >> >> data.table wiki, but if you have, say, 100 variables to select out of >> >> a possible 1000, it doesn't make sense to list them individually as >> >> recommended on the wiki. (But see below...) >> >> >> >> library('data.table') >> >> set.seed(1043) >> >> m <- matrix(rpois(240, 10), nrow = 6) >> >> colnames(m) <- paste('A', 1:40, sep = '') >> >> m <- as.data.frame(m) >> >> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = >> >> 'x') >> >> dim(dt2) >> >> # [1] 6 42 ...so far, so good >> >> >> >> # Subset of variables for which sums are desired >> >> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '') >> >> >> >> # One approach: use the select = argument of subset() to restrict >> >> # the variables under consideration: >> >> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x'] >> >> x A1 A4 A10 A15 A31 >> >> [1,] 1 18 21 22 22 24 >> >> [2,] 2 20 13 27 23 21 >> >> [3,] 3 22 15 16 23 15 >> >> >> >> # Use the with = FALSE construct of data.table to do the same: >> >> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'] >> >> x y A1 A4 A10 A15 A31 >> >> [1,] 1 1 11 13 12 11 16 >> >> [2,] 1 2 7 8 10 11 8 >> >> [3,] 2 1 10 4 16 7 11 >> >> [4,] 2 3 10 9 11 16 10 >> >> [5,] 3 2 11 8 7 11 7 >> >> [6,] 3 3 11 7 9 12 8 >> >> >> >> # For this example, it is the same (apart from the key variables) as >> >> dt2[, vars, with = FALSE] >> >> >> >> Not bad for this small example, but what happens in a much larger data >> >> table? >> >> >> >> To find out, I created a 10000 x 1000 matrix that I converted into a >> >> data table, added two grouping variables of 100 levels each and then >> >> tried both approaches above again. Performance isn't bad when >> >> summarizing over one variable, but there is a definite hit when two >> >> variables are summarized. [It makes some sense since one is grouping >> >> over 10000 level combinations rather than 100, but once again, keep >> >> reading.] Curiously, it makes no difference if there is one key >> >> variable or two, which made me wonder what the preferred approach is >> >> in this circumstance. >> >> >> >> m <- matrix(rpois(10000000, 10), nrow = 10000) >> >> m <- as.data.table(m) >> >> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100)) >> >> setkey(m, 'x') >> >> dim(m) >> >> # [1] 10000 1002 >> >> >> >> # Randomly select 150 variables from the 1000 >> >> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '') >> >> length(vars) >> >> # [1] 150 >> >> key(m) >> >> # [1] "x" >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x']) >> >> user system elapsed >> >> 0.75 0.00 0.75 >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x']) >> >> user system elapsed >> >> 0.64 0.00 0.64 >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y']) >> >> user system elapsed >> >> 53.65 0.00 53.85 >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']) >> >> user system elapsed >> >> 44.21 0.01 44.35 >> >> >> >> m2 <- data.table(m, key = 'x, y') >> >> rm(m) >> >> key(m2) >> >> # [1] "x" "y" >> >>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, >> >>> y']) >> >> user system elapsed >> >> 53.54 0.00 53.73 >> >>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']) >> >> user system elapsed >> >> 44.30 0.04 44.60 >> >> >> >> The first question in the wiki >> >> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says >> >> to use the columns directly rather than to rely on .SD. I wanted to >> >> know how to pass new names to the summaries instead of overwriting the >> >> original variable names. For the fun of it, I tried the following: >> >> >> >> select <- sample(1:1000, 150, replace = FALSE) >> >> vars <- paste('A', select, sep = '') >> >> outvars <- paste('S', select, sep = '') >> >> >> >> # Create a long expression of the form 'list(..., Sn = sum(An), ...)', >> >> # n a subscript from 1 to 150. >> >> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep = >> >> ''), sep = '=', collapse = ','), >> >> ')', sep = '') >> >> u <- m2[, eval(parse(text = expr)), by = 'x'] >> >>> dim(u) >> >> # [1] 100 151 seems reasonable... >> >> >> >> This seemed to run rather fast, so I decided to time it: >> >> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x']) >> >> user system elapsed >> >> 0.03 0.00 0.03 >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x, y']) >> >> user system elapsed >> >> 1.05 0.00 1.04 >> >> >> >> I've got to admit, this is not the approach I would have taken >> >> normally, is certainly not intuitively obvious to me and flouts the >> >> usual advice to avoid the eval(parse(text = )) mantra, but the data >> >> don't lie :) Please tell me there's a more code-efficient way to do >> >> this (the new variable names included), because my 'solution' was a >> >> complete kludge and accidental kewpie prize. >> >> >> >> Cheers, >> >> Dennis >> >> >> >>> Message: 1 >> >>> Date: Thu, 14 Jul 2011 16:36:11 -0400 >> >>> From: Joseph Voelkel <[email protected]> >> >>> Subject: [datatable-help] Skipping some Vi names >> >>> To: "[email protected]" >> >>> <[email protected]> >> >>> Message-ID: >> >>> >> >>> <[email protected]> >> >>> Content-Type: text/plain; charset="us-ascii" >> >>> >> >>> I don't use data.table too much (though I probably should use it >> >>> more...). >> >>> >> >>> I was surprised at the results below. It appears that the name V1 gets >> >>> assigned to the first result, but then the keys ("in the background") are >> >>> assigned the next set of Vi names, creating a gap in the names depending >> >>> on >> >>> the number of keys. I would like to see the Vi names appear in their >> >>> natural, sequential, order. Not a show stopper, but it's annoying. (I >> >>> have >> >>> over 40 Vi's and it'd be good to have them numbered more rationally.) >> >>> Thanks. >> >>> >> >>>> >> >>>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x") >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"] >> >>> x V1 V3 V4 >> >>> [1,] 1 5 17 29 >> >>> [2,] 2 7 19 31 >> >>> [3,] 3 9 21 33 >> >>>> key(dt)<-c("x","y") >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"] >> >>> x y V1 V4 V5 >> >>> [1,] 1 1 1 7 13 >> >>> [2,] 1 2 4 10 16 >> >>> [3,] 2 1 2 8 14 >> >>> [4,] 2 3 5 11 17 >> >>> [5,] 3 2 3 9 15 >> >>> [6,] 3 3 6 12 18 >> >>> >> >>> >> >>> >> >>> Joseph G. Voelkel, Ph.D. >> >>> Professor, Center for Quality and Applied Statistics >> >>> Kate Gleason College of Engineering >> >>> Rochester Institute of Technology >> >>> V 585-475-2231 >> >>> F 585-475-5959 >> >>> [email protected] >> >>> >> >> _______________________________________________ >> >> datatable-help mailing list >> >> [email protected] >> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > >> > >> > >> > -- >> > Steve Lianoglou >> > Graduate Student: Computational Systems Biology >> > | Memorial Sloan-Kettering Cancer Center >> > | Weill Medical College of Cornell University >> > Contact Info: http://cbio.mskcc.org/~lianos/contact >> > >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
