Glad all clear. Given the follow up head() examples, yes, .SD is there
for just that purpose. Something like this : DT[, head(.SD,2), by=colA] is idiomatic in data.table. That's like a "select top 2 * from" in SQL, but by group. Also things like : DT[, .SD[1:2], by=colA] # similar provided all groups have at least 2 rows DT[, .SD[-1], by=colA] # all but the first DT[, someFunctionThatWantsADataFrame(..., data=.SD), by=colA] It's when you don't use all the data in .SD that it's wasteful to use it (since data.table needs to populate it for each group before running j). So in the subset of rows of .SD examples above, something like this can be a lot faster : w = DT[,head(.I,5),by=colA][[2]] # top 5 row numbers of each group DT[w] # select those rows is the same but must faster than DT[, head(.SD,5), by=colA] especially if each of the groups have a lot more rows than 5. Hope that adds some colour. On 17.01.2013 17:33, David Bellot wrote: > indeed, it makes sense now, as what is passed to the function is indeed a data.table and not a data.frame. > > Thanks guys for your help. Now I'm a convinced data.table user. > Best, > David > > On Thu, Jan 17, 2013 at 5:25 PM, Akhil Behl <[email protected] [8]> wrote: > >> Hey David, >> >> I thought your problem may have been a typo, but I realized that it is >> in fact a subtle difference between the way data.table and data.frame >> work. >> >> One must provide unquoted names in the `j' expression for a >> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"] (which >> will evaluate to just "y" and hence the error). >> >> There are tricks around it like using with=FALSE, or using the >> data.frame notation x.dt[["y"]]. But once again, you will find such >> examples and explanations of idiomatic data.table expressions in the >> vignettes. >> >> -- >> ASB. >> >> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot <[email protected] [1]> wrote: >> > Hi Matthew, >> > >> > I read indeed the introduction but I wasn't sure about the way to write it. >> > Hence my question. >> > >> > In fact, I do agree if the function would sum(sqrt(y)), but in my case, I >> > would like to do something like >> > >> > f > >> > It's a small example for the sake of simplicity, just to illustrate that I >> > really want to have access to the full sub data.frame (the d variable) and >> > not just one column. >> > >> > Best, >> > David >> > >> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew Dowle <[email protected] [2]> >> > wrote: >> >> >> >> >> >> Akhil, >> >> >> >> Kind of, but defining : >> >> >> >> my.func >> sum(sqrt(d[["y"]])) >> >> } >> >> >> >> followed by >> >> >> >> x.dt[ , my.func(.SD), by=x] >> >> >> >> isn't very data.table'ish. In fact the >> >> advice is to avoid .SD if possible, for speed. >> >> >> >> We'd forget my.funct, and just do : >> >> >> >> x.dt[, sum(sqrt(y)), by=x] >> >> >> >> That is how we recommend it to be used, and >> >> allows data.table to optimize the query (which >> >> use of .SD may prevent). >> >> >> >> David - have you read the introduction vignette and have >> >> you worked through example(data.table) at the prompt? >> >> >> >> Matthew >> >> >> >> >> >> >> >> On 17.01.2013 16:53, Akhil Behl wrote: >> >>> >> >>> If I am not wrong, you are looking for `.SD'. In fact you can put in >> >>> the exact function you were throwing at ddply earlier. There are other >> >>> special names like .SD that you can find in the data.table FAQs. >> >>> >> >>> Let's see: >> >>> R> require(plyr) >> >>> Loading required package: plyr >> >>> >> >>> R> require(data.table) >> >>> Loading required package: data.table >> >>> data.table 1.8.7 For help type: help("data.table") >> >>> >> >>> R> x.df >>> R> x.dt >>> R> >> >>> R> my.func >>> + sum(sqrt(d[["y"]])) >> >>> + } >> >>> R> >> >>> R> # The plyr way: >> >>> R> ddply(x.df, "x", my.func) -> ans.plyr >> >>> R> >> >>> R> # The data.table way: >> >>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt >> >>> R> >> >>> R> ans.plyr >> >>> x V1 >> >>> 1 a 10.61387 >> >>> 2 b 11.85441 >> >>> >> >>> R> ans.dt >> >>> x V1 >> >>> 1: a 10.61387 >> >>> 2: b 11.85441 >> >>> >> >>> For more help, try this on an R prompt: >> >>> >> >>> R> vignette('datatable-faq') >> >>> >> >>> -- >> >>> ASB. >> >>> >> >>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot <[email protected] [3]> >> >>> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I've been looking all around the web without a clear answer to this >> >>>> trivial >> >>>> problem. I'm sure I'm not looking where I should: >> >>>> >> >>>> in fact, I want to replace my use of ddply from the plyr package by >> >>>> data.table. One of my main use is to group a big data.frame by a group >> >>>> of >> >>>> variable and do something on this sub data.frame: >> >>>> >> >>>> ddply( my_df, my_grouping_var, function (d) { do something with d } ) >> >>>> ----> d is a data.frame again >> >>>> >> >>>> and it's slow on big data.frame. >> >>>> >> >>>> >> >>>> However, I don't really understand how to redo the same thing with a >> >>>> data.table. Basically if "j" in a data.table is equivalent to the select >> >>>> clause in SQL, then how do I do SELECT * FROM etc... >> >>>> >> >>>> I want to be able to pass a function like in ddply that will receive not >> >>>> only a few columns but the full subset that is selected by the "by" >> >>>> clause. >> >>>> >> >>>> Thanks... >> >>>> Best, >> >>>> David >> >>>> >> >>>> _______________________________________________ >> >>>> datatable-help mailing list >> >>>> [email protected] [4] >> >>>> >> >>>> >> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >> >>> >> >>> _______________________________________________ >> >>> datatable-help mailing list >> >>> [email protected] [6] >> >>> >> >>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] >> > >> > Links: ------ [1] mailto:[email protected] [2] mailto:[email protected] [3] mailto:[email protected] [4] mailto:[email protected] [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:[email protected] [7] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] mailto:[email protected]
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
