Well, yes, I agree. In fact, I had it in mind to mention the alternative you suggested, but then it slipped out of my mind.
I did point him to the datatable-faq. :) On Thu, Jan 17, 2013 at 10:37 PM, Matthew Dowle <[email protected]> wrote: > > Akhil, > > Kind of, but defining : > > my.func <- function (d) { > sum(sqrt(d[["y"]])) > } > > followed by > > x.dt[ , my.func(.SD), by=x] > > isn't very data.table'ish. In fact the > advice is to avoid .SD if possible, for speed. > > We'd forget my.funct, and just do : > > x.dt[, sum(sqrt(y)), by=x] > > That is how we recommend it to be used, and > allows data.table to optimize the query (which > use of .SD may prevent). > > David - have you read the introduction vignette and have > you worked through example(data.table) at the prompt? > > Matthew > > > > On 17.01.2013 16:53, Akhil Behl wrote: >> >> If I am not wrong, you are looking for `.SD'. In fact you can put in >> the exact function you were throwing at ddply earlier. There are other >> special names like .SD that you can find in the data.table FAQs. >> >> Let's see: >> R> require(plyr) >> Loading required package: plyr >> >> R> require(data.table) >> Loading required package: data.table >> data.table 1.8.7 For help type: help("data.table") >> >> R> x.df <- data.frame(x=letters[1:2], y=1:10) >> R> x.dt <- data.table(x.df) >> R> >> R> my.func <- function (d) { # Define a function on the subset >> + sum(sqrt(d[["y"]])) >> + } >> R> >> R> # The plyr way: >> R> ddply(x.df, "x", my.func) -> ans.plyr >> R> >> R> # The data.table way: >> R> x.dt[ , my.func(.SD), by=x] -> ans.dt >> R> >> R> ans.plyr >> x V1 >> 1 a 10.61387 >> 2 b 11.85441 >> >> R> ans.dt >> x V1 >> 1: a 10.61387 >> 2: b 11.85441 >> >> For more help, try this on an R prompt: >> >> R> vignette('datatable-faq') >> >> -- >> ASB. >> >> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot <[email protected]> >> wrote: >>> >>> Hi, >>> >>> I've been looking all around the web without a clear answer to this >>> trivial >>> problem. I'm sure I'm not looking where I should: >>> >>> in fact, I want to replace my use of ddply from the plyr package by >>> data.table. One of my main use is to group a big data.frame by a group of >>> variable and do something on this sub data.frame: >>> >>> ddply( my_df, my_grouping_var, function (d) { do something with d } ) >>> ----> d is a data.frame again >>> >>> and it's slow on big data.frame. >>> >>> >>> However, I don't really understand how to redo the same thing with a >>> data.table. Basically if "j" in a data.table is equivalent to the select >>> clause in SQL, then how do I do SELECT * FROM etc... >>> >>> I want to be able to pass a function like in ddply that will receive not >>> only a few columns but the full subset that is selected by the "by" >>> clause. >>> >>> Thanks... >>> Best, >>> David >>> >>> _______________________________________________ >>> datatable-help mailing list >>> [email protected] >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
