That .I example is quite interesting. May I ask: Suppose I wanted to get the 5 row numbers for each subset (say 5 of them) and save them in a list in stead of a data.table (kind of like dlply) to be able to use the lapply idiom later on. Is there a way to do that?
Thanks. -- ASB. PS: Is this question hijacking the thread? Sorry, if it is. On Fri, Jan 18, 2013 at 12:01 AM, Matthew Dowle <[email protected]> wrote: > > > Glad all clear. Given the follow up head() examples, yes, .SD is there > > for just that purpose. Something like this : > > DT[, head(.SD,2), by=colA] > > is idiomatic in data.table. That's like a "select top 2 * from" in SQL, but > by group. > > Also things like : > > DT[, .SD[1:2], by=colA] # similar provided all groups have at least 2 > rows > > DT[, .SD[-1], by=colA] # all but the first > > DT[, someFunctionThatWantsADataFrame(..., data=.SD), by=colA] > > > > It's when you don't use all the data in .SD that it's wasteful to use it > (since > > data.table needs to populate it for each group before running j). > > So in the subset of rows of .SD examples above, something like this can > > be a lot faster : > > w = DT[,head(.I,5),by=colA][[2]] # top 5 row numbers of each group > > DT[w] # select those rows > > is the same but must faster than > > DT[, head(.SD,5), by=colA] > > especially if each of the groups have a lot more rows than 5. > > Hope that adds some colour. > > > > On 17.01.2013 17:33, David Bellot wrote: > > indeed, it makes sense now, as what is passed to the function is indeed a > data.table and not a data.frame. > > Thanks guys for your help. Now I'm a convinced data.table user. > Best, > David > > On Thu, Jan 17, 2013 at 5:25 PM, Akhil Behl <[email protected]> wrote: >> >> Hey David, >> >> I thought your problem may have been a typo, but I realized that it is >> in fact a subtle difference between the way data.table and data.frame >> work. >> >> One must provide unquoted names in the `j' expression for a >> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"] (which >> will evaluate to just "y" and hence the error). >> >> There are tricks around it like using with=FALSE, or using the >> data.frame notation x.dt[["y"]]. But once again, you will find such >> examples and explanations of idiomatic data.table expressions in the >> vignettes. >> >> -- >> ASB. >> >> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot <[email protected]> >> wrote: >> > Hi Matthew, >> > >> > I read indeed the introduction but I wasn't sure about the way to write >> > it. >> > Hence my question. >> > >> > In fact, I do agree if the function would sum(sqrt(y)), but in my case, >> > I >> > would like to do something like >> > >> > f > >> > It's a small example for the sake of simplicity, just to illustrate that >> > I >> > really want to have access to the full sub data.frame (the d variable) >> > and >> > not just one column. >> > >> > Best, >> > David >> > >> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew Dowle <[email protected]> >> > wrote: >> >> >> >> >> >> Akhil, >> >> >> >> Kind of, but defining : >> >> >> >> my.func >> sum(sqrt(d[["y"]])) >> >> >> } >> >> >> >> followed by >> >> >> >> x.dt[ , my.func(.SD), by=x] >> >> >> >> isn't very data.table'ish. In fact the >> >> advice is to avoid .SD if possible, for speed. >> >> >> >> We'd forget my.funct, and just do : >> >> >> >> x.dt[, sum(sqrt(y)), by=x] >> >> >> >> That is how we recommend it to be used, and >> >> allows data.table to optimize the query (which >> >> use of .SD may prevent). >> >> >> >> David - have you read the introduction vignette and have >> >> you worked through example(data.table) at the prompt? >> >> >> >> Matthew >> >> >> >> >> >> >> >> On 17.01.2013 16:53, Akhil Behl wrote: >> >>> >> >>> If I am not wrong, you are looking for `.SD'. In fact you can put in >> >>> the exact function you were throwing at ddply earlier. There are other >> >>> special names like .SD that you can find in the data.table FAQs. >> >>> >> >>> Let's see: >> >>> R> require(plyr) >> >>> Loading required package: plyr >> >>> >> >>> R> require(data.table) >> >>> Loading required package: data.table >> >>> data.table 1.8.7 For help type: help("data.table") >> >>> >> >>> R> x.df >>> R> x.dt >>> R> >> >>> R> my.func >>> + sum(sqrt(d[["y"]])) >> >> >>> + } >> >>> R> >> >>> R> # The plyr way: >> >>> R> ddply(x.df, "x", my.func) -> ans.plyr >> >>> R> >> >>> R> # The data.table way: >> >>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt >> >>> R> >> >>> R> ans.plyr >> >>> x V1 >> >>> 1 a 10.61387 >> >>> 2 b 11.85441 >> >>> >> >>> R> ans.dt >> >>> x V1 >> >>> 1: a 10.61387 >> >>> 2: b 11.85441 >> >>> >> >>> For more help, try this on an R prompt: >> >>> >> >>> R> vignette('datatable-faq') >> >>> >> >>> -- >> >>> ASB. >> >>> >> >>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot <[email protected]> >> >>> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I've been looking all around the web without a clear answer to this >> >>>> trivial >> >>>> problem. I'm sure I'm not looking where I should: >> >>>> >> >>>> in fact, I want to replace my use of ddply from the plyr package by >> >>>> data.table. One of my main use is to group a big data.frame by a >> >>>> group >> >>>> of >> >>>> variable and do something on this sub data.frame: >> >>>> >> >>>> ddply( my_df, my_grouping_var, function (d) { do something with d } >> >>>> ) >> >>>> ----> d is a data.frame again >> >>>> >> >>>> and it's slow on big data.frame. >> >>>> >> >>>> >> >>>> However, I don't really understand how to redo the same thing with a >> >>>> data.table. Basically if "j" in a data.table is equivalent to the >> >>>> select >> >>>> clause in SQL, then how do I do SELECT * FROM etc... >> >>>> >> >>>> I want to be able to pass a function like in ddply that will receive >> >>>> not >> >>>> only a few columns but the full subset that is selected by the "by" >> >>>> clause. >> >>>> >> >>>> Thanks... >> >>>> Best, >> >>>> David >> >>>> >> >>>> _______________________________________________ >> >>>> datatable-help mailing list >> >>>> [email protected] >> >>>> >> >>>> >> >>>> >> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >>> >> >>> _______________________________________________ >> >>> datatable-help mailing list >> >>> [email protected] >> >>> >> >>> >> >>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> > > > > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
