Yes use a list column, iiuc. DT[,list( list(head(.I,5))), by=ColA]
More useful perhaps is returning the unique items of a column, by group. Where the length of each vector in each cell varies. > That .I example is quite interesting. May I ask: > > Suppose I wanted to get the 5 row numbers for each subset (say 5 of > them) and save them in a list in stead of a data.table (kind of like > dlply) to be able to use the lapply idiom later on. Is there a way to > do that? > > Thanks. > > -- > ASB. > > PS: Is this question hijacking the thread? Sorry, if it is. > > On Fri, Jan 18, 2013 at 12:01 AM, Matthew Dowle <[email protected]> > wrote: >> >> >> Glad all clear. Given the follow up head() examples, yes, .SD is there >> >> for just that purpose. Something like this : >> >> DT[, head(.SD,2), by=colA] >> >> is idiomatic in data.table. That's like a "select top 2 * from" in SQL, >> but >> by group. >> >> Also things like : >> >> DT[, .SD[1:2], by=colA] # similar provided all groups have at >> least 2 >> rows >> >> DT[, .SD[-1], by=colA] # all but the first >> >> DT[, someFunctionThatWantsADataFrame(..., data=.SD), by=colA] >> >> >> >> It's when you don't use all the data in .SD that it's wasteful to use it >> (since >> >> data.table needs to populate it for each group before running j). >> >> So in the subset of rows of .SD examples above, something like this can >> >> be a lot faster : >> >> w = DT[,head(.I,5),by=colA][[2]] # top 5 row numbers of each >> group >> >> DT[w] # select those rows >> >> is the same but must faster than >> >> DT[, head(.SD,5), by=colA] >> >> especially if each of the groups have a lot more rows than 5. >> >> Hope that adds some colour. >> >> >> >> On 17.01.2013 17:33, David Bellot wrote: >> >> indeed, it makes sense now, as what is passed to the function is indeed >> a >> data.table and not a data.frame. >> >> Thanks guys for your help. Now I'm a convinced data.table user. >> Best, >> David >> >> On Thu, Jan 17, 2013 at 5:25 PM, Akhil Behl <[email protected]> wrote: >>> >>> Hey David, >>> >>> I thought your problem may have been a typo, but I realized that it is >>> in fact a subtle difference between the way data.table and data.frame >>> work. >>> >>> One must provide unquoted names in the `j' expression for a >>> data.table, i.e. one can say x.dt[ , y] but not x.dt[ , "y"] (which >>> will evaluate to just "y" and hence the error). >>> >>> There are tricks around it like using with=FALSE, or using the >>> data.frame notation x.dt[["y"]]. But once again, you will find such >>> examples and explanations of idiomatic data.table expressions in the >>> vignettes. >>> >>> -- >>> ASB. >>> >>> On Thu, Jan 17, 2013 at 10:42 PM, David Bellot <[email protected]> >>> wrote: >>> > Hi Matthew, >>> > >>> > I read indeed the introduction but I wasn't sure about the way to >>> write >>> > it. >>> > Hence my question. >>> > >>> > In fact, I do agree if the function would sum(sqrt(y)), but in my >>> case, >>> > I >>> > would like to do something like >>> > >>> > f > >>> > It's a small example for the sake of simplicity, just to illustrate >>> that >>> > I >>> > really want to have access to the full sub data.frame (the d >>> variable) >>> > and >>> > not just one column. >>> > >>> > Best, >>> > David >>> > >>> > On Thu, Jan 17, 2013 at 5:07 PM, Matthew Dowle >>> <[email protected]> >>> > wrote: >>> >> >>> >> >>> >> Akhil, >>> >> >>> >> Kind of, but defining : >>> >> >>> >> my.func >> sum(sqrt(d[["y"]])) >>> >>> >> } >>> >> >>> >> followed by >>> >> >>> >> x.dt[ , my.func(.SD), by=x] >>> >> >>> >> isn't very data.table'ish. In fact the >>> >> advice is to avoid .SD if possible, for speed. >>> >> >>> >> We'd forget my.funct, and just do : >>> >> >>> >> x.dt[, sum(sqrt(y)), by=x] >>> >> >>> >> That is how we recommend it to be used, and >>> >> allows data.table to optimize the query (which >>> >> use of .SD may prevent). >>> >> >>> >> David - have you read the introduction vignette and have >>> >> you worked through example(data.table) at the prompt? >>> >> >>> >> Matthew >>> >> >>> >> >>> >> >>> >> On 17.01.2013 16:53, Akhil Behl wrote: >>> >>> >>> >>> If I am not wrong, you are looking for `.SD'. In fact you can put >>> in >>> >>> the exact function you were throwing at ddply earlier. There are >>> other >>> >>> special names like .SD that you can find in the data.table FAQs. >>> >>> >>> >>> Let's see: >>> >>> R> require(plyr) >>> >>> Loading required package: plyr >>> >>> >>> >>> R> require(data.table) >>> >>> Loading required package: data.table >>> >>> data.table 1.8.7 For help type: help("data.table") >>> >>> >>> >>> R> x.df >>> R> x.dt >>> R> >>> >>> R> my.func >>> + sum(sqrt(d[["y"]])) >>> >>> >>> + } >>> >>> R> >>> >>> R> # The plyr way: >>> >>> R> ddply(x.df, "x", my.func) -> ans.plyr >>> >>> R> >>> >>> R> # The data.table way: >>> >>> R> x.dt[ , my.func(.SD), by=x] -> ans.dt >>> >>> R> >>> >>> R> ans.plyr >>> >>> x V1 >>> >>> 1 a 10.61387 >>> >>> 2 b 11.85441 >>> >>> >>> >>> R> ans.dt >>> >>> x V1 >>> >>> 1: a 10.61387 >>> >>> 2: b 11.85441 >>> >>> >>> >>> For more help, try this on an R prompt: >>> >>> >>> >>> R> vignette('datatable-faq') >>> >>> >>> >>> -- >>> >>> ASB. >>> >>> >>> >>> On Thu, Jan 17, 2013 at 9:49 PM, David Bellot >>> <[email protected]> >>> >>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> I've been looking all around the web without a clear answer to >>> this >>> >>>> trivial >>> >>>> problem. I'm sure I'm not looking where I should: >>> >>>> >>> >>>> in fact, I want to replace my use of ddply from the plyr package >>> by >>> >>>> data.table. One of my main use is to group a big data.frame by a >>> >>>> group >>> >>>> of >>> >>>> variable and do something on this sub data.frame: >>> >>>> >>> >>>> ddply( my_df, my_grouping_var, function (d) { do something with >>> d } >>> >>>> ) >>> >>>> ----> d is a data.frame again >>> >>>> >>> >>>> and it's slow on big data.frame. >>> >>>> >>> >>>> >>> >>>> However, I don't really understand how to redo the same thing with >>> a >>> >>>> data.table. Basically if "j" in a data.table is equivalent to the >>> >>>> select >>> >>>> clause in SQL, then how do I do SELECT * FROM etc... >>> >>>> >>> >>>> I want to be able to pass a function like in ddply that will >>> receive >>> >>>> not >>> >>>> only a few columns but the full subset that is selected by the >>> "by" >>> >>>> clause. >>> >>>> >>> >>>> Thanks... >>> >>>> Best, >>> >>>> David >>> >>>> >>> >>>> _______________________________________________ >>> >>>> datatable-help mailing list >>> >>>> [email protected] >>> >>>> >>> >>>> >>> >>>> >>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >>> _______________________________________________ >>> >>> datatable-help mailing list >>> >>> [email protected] >>> >>> >>> >>> >>> >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> > >>> > >> >> >> >> > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
