Excellent everyone, thanks so much. .SD was the feature I did not know about that makes it easy to pass in strings like I wanted, thanks. I was also able to get a large speedup by implementing the function f using an idea from the wiki, as:
f <- function(x) length(.Internal(unique(x, FALSE, FALSE))) Thanks again; very useful package! On Wed, Sep 28, 2011 at 11:58 AM, Matthew Dowle <[email protected]> wrote: > items 1 and 5 on the wiki are relevant here, for speed comparisons : > http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table > > "Matthew Dowle" <[email protected]> wrote in message > news:[email protected]... >> >> Something like this : >> >>> DT = as.data.table(testData) >>> f = function(x)length(unique(x)) >>> vars = "dx" >>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1] >> dx >> 44.2212 >>> vars = c("dx","rx") >>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1] # same again, just >>> different vars >> dx rx >> 44.2212 48.7814 >>> vars = c("dx","rx","clinic") >>> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1] # same again, just >>> different vars >> dx rx clinic >> 44.2212 48.7814 9.9331 >>> >> >> Chris' suggestion of parse(text=paste(...)) is another way you could do it >> (and may be more efficient). >> >> Matthew >> >> >> "Matthew Dowle" <[email protected]> wrote in message >> news:[email protected]... >>> Hi, >>> >>> Welcome. >>> Just to check you've found .SD, [,lapply(.SD,sum),by=...], and .SDcols? >>> .SD consist of all columns other than the grouping columns, which seems >>> similar >>> to what this line is doing? : >>>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE]) >>> >>> Matthew >>> >>> >>> "Erik Iverson" <[email protected]> wrote in message >>> news:cakzgw12zwppt3psqjcdh_smdoqolajruv7cv64uywb8pxk1...@mail.gmail.com... >>> Hello, >>> >>> Thank you for providing the data.table package, I think it will be >>> very useful to me going forward. I have a question about passing >>> around expressions, and have come up with an example to show what I'm >>> after. >>> >>> library(data.table) >>> ## test data >>> N <- 500000 >>> set.seed(100) >>> testData <- data.frame(id = c(sample(1:10000, N, replace = TRUE)), >>> clinic = c(sample(1:10, N, replace = TRUE)), >>> dx = c(sample(1:200, N, replace = TRUE)), >>> rx = c(sample(1:1000, N, replace = TRUE))) >>> >>> ## want to know mean number of dx per ID >>> mean(tapply(testData$dx, testData$id, >>> function(x) length(unique(x)))) ## 44.2212 >>> >>> ## in my real use case, I want to run this with different 'by' >>> ## variables, so let's write a function and try to use data.table, >>> ## call the function uniqueSummary1 >>> >>> uniqueSummary1 <- function(df, key) { >>> DT <- data.table(df) >>> key(DT) <- key >>> >>> summaryDT <- DT[, list(length(unique(dx)), >>> length(unique(rx))), by = key] >>> >>> mean(summaryDT[,list(V1, V2)]) >>> >>> } >>> >>> ## agrees with tapply >>> uniqueSummary1(df = testData, key = c("id")) >>> >>> ## The above works great, but isn't general, since in my real use >>> ## case, I won't know dx and rx are the variables of interest. I want >>> ## to be able to pass them in as arguments. This is exactly what FAQ >>> ## 1.6 is, so let's use that solution to define uniqueSummary2 >>> >>> uniqueSummary2 <- function(df, key, vars) { >>> DT <- data.table(df) >>> key(DT) <- key >>> >>> sList <- substitute(vars) >>> summaryDT <- DT[, eval(sList), by = key] >>> ncols <- ncol(summaryDT) >>> >>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE]) >>> } >>> >>> uniqueSummary2(df = testData, key = c("id"), >>> vars = list(length(unique(dx)), >>> length(unique(rx)), >>> length(unique(clinic)))) >>> >>> ## uniqueSummary2 is better, but relies on me repeating the >>> ## "length(unique())" bit several times. Ideally, I'd just like to >>> ## pass in a list of QUOTED vars to summarize, like the following >>> ## hypothetical call to my yet-unwritten uniqueSummary3 function: >>> >>> uniqueSummary3(df = testData, key = c("id"), >>> vars = c("dx", "rx", "clinic")) >>> >>> I assume I can somehow construct the expression for the j index inside >>> my function, based on the 'vars' character vector, but am stuck on >>> how. Any ideas? >>> >>> Thanks so much, >>> Erik > > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
