items 1 and 5 on the wiki are relevant here, for speed comparisons : http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table
"Matthew Dowle" <[email protected]> wrote in message news:[email protected]... > > Something like this : > >> DT = as.data.table(testData) >> f = function(x)length(unique(x)) >> vars = "dx" >> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1] > dx > 44.2212 >> vars = c("dx","rx") >> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1] # same again, just >> different vars > dx rx > 44.2212 48.7814 >> vars = c("dx","rx","clinic") >> mean(DT[,lapply(.SD,f),by="id",.SDcols=vars])[-1] # same again, just >> different vars > dx rx clinic > 44.2212 48.7814 9.9331 >> > > Chris' suggestion of parse(text=paste(...)) is another way you could do it > (and may be more efficient). > > Matthew > > > "Matthew Dowle" <[email protected]> wrote in message > news:[email protected]... >> Hi, >> >> Welcome. >> Just to check you've found .SD, [,lapply(.SD,sum),by=...], and .SDcols? >> .SD consist of all columns other than the grouping columns, which seems >> similar >> to what this line is doing? : >>> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE]) >> >> Matthew >> >> >> "Erik Iverson" <[email protected]> wrote in message >> news:cakzgw12zwppt3psqjcdh_smdoqolajruv7cv64uywb8pxk1...@mail.gmail.com... >> Hello, >> >> Thank you for providing the data.table package, I think it will be >> very useful to me going forward. I have a question about passing >> around expressions, and have come up with an example to show what I'm >> after. >> >> library(data.table) >> ## test data >> N <- 500000 >> set.seed(100) >> testData <- data.frame(id = c(sample(1:10000, N, replace = TRUE)), >> clinic = c(sample(1:10, N, replace = TRUE)), >> dx = c(sample(1:200, N, replace = TRUE)), >> rx = c(sample(1:1000, N, replace = TRUE))) >> >> ## want to know mean number of dx per ID >> mean(tapply(testData$dx, testData$id, >> function(x) length(unique(x)))) ## 44.2212 >> >> ## in my real use case, I want to run this with different 'by' >> ## variables, so let's write a function and try to use data.table, >> ## call the function uniqueSummary1 >> >> uniqueSummary1 <- function(df, key) { >> DT <- data.table(df) >> key(DT) <- key >> >> summaryDT <- DT[, list(length(unique(dx)), >> length(unique(rx))), by = key] >> >> mean(summaryDT[,list(V1, V2)]) >> >> } >> >> ## agrees with tapply >> uniqueSummary1(df = testData, key = c("id")) >> >> ## The above works great, but isn't general, since in my real use >> ## case, I won't know dx and rx are the variables of interest. I want >> ## to be able to pass them in as arguments. This is exactly what FAQ >> ## 1.6 is, so let's use that solution to define uniqueSummary2 >> >> uniqueSummary2 <- function(df, key, vars) { >> DT <- data.table(df) >> key(DT) <- key >> >> sList <- substitute(vars) >> summaryDT <- DT[, eval(sList), by = key] >> ncols <- ncol(summaryDT) >> >> mean(summaryDT[,(ncols-length(sList) + 2):ncols, with = FALSE]) >> } >> >> uniqueSummary2(df = testData, key = c("id"), >> vars = list(length(unique(dx)), >> length(unique(rx)), >> length(unique(clinic)))) >> >> ## uniqueSummary2 is better, but relies on me repeating the >> ## "length(unique())" bit several times. Ideally, I'd just like to >> ## pass in a list of QUOTED vars to summarize, like the following >> ## hypothetical call to my yet-unwritten uniqueSummary3 function: >> >> uniqueSummary3(df = testData, key = c("id"), >> vars = c("dx", "rx", "clinic")) >> >> I assume I can somehow construct the expression for the j index inside >> my function, based on the 'vars' character vector, but am stuck on >> how. Any ideas? >> >> Thanks so much, >> Erik _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
