Matthew, Yes, that clarifies things. It makes more sense, especially with the option of being able to use `.SDcols` to include it. And thanks for the change in documentation as well; adds more clarity.
Best, Arun On Sunday, May 12, 2013 at 2:45 PM, Matthew Dowle wrote: > On 12.05.2013 12:54, Arunkumar Srinivasan wrote: > > I just realised that I sent it only to MatthewDowle. So, sending it again. > > Sorry @Matthew for the double email. > > Matthew, > > >> .BY is available to j already for that reason, does that work? .BY isn't > > >> a column of .SD because i) it's the same value for every row of .SD i.e. > > >> .BY[[1]] is length 1 and contains this particular group (replicating the > > >> same value would be wasteful) > > DT[, print(.BY), by = list(grp = x %/% 2)] > > $grp > > [1] 0 > > $grp > > [1] 1 > > $grp > > [1] 2 > > > > DT[, print(.SD), by = list(grp = x %/% 2)] # no column "x" > > y > > 1: 6 > > y > > 1: 7 > > 2: 8 > > y > > 1: 9 > > 2: 10 > > > > My question is not as to why the BY column is not available in .SD. Rather, > > since .BY does not have column "x" in it (rather the result of x%/% 2), why > > does .SD not have "x"? It's as if grp = x%/%2 is a "new column". So, "x" > > should be available to .SD is my point. > > > > > > > > Oh I see now. Yes data.table inspects the expressions used in 'by' and > considers any columns used there as grouping columns and excludes those from > .SD. An example is a date column containing daily observations. DT[, > lapply(.SD,sum), by=month(date)] would not wish to sum() the "date" column. > In ?data.table I've just changed : > .SD is a data.table containing the Subset of x's Data for each group, > excluding the group column(s). > to > .SD is a data.table containing the Subset of x's Data for each group, > excluding any columns used in 'by' (or 'keyby'). > Further answer below ... > > >> but more significantly ii) it is often a character group name where > > >> running an aggregation function like sum() would trip up on it. > > Again, I don't think so because, I am not asking for .BY columns to be in > > .SD. > > DT[, grp := x%/% = 2] > > DT[, lapply(.SD, sum), by=grp] > > must be equal to: > > DT[, lapply(.SD, sum), by = list(grp = x%/%2)] # here, "x" should be > > available to .SD as it's not the grouping column > > > > > > > > This makes sense in this case because x can be sum()-ed, but isn't true in > general like the month(date) case above. > In these cases you can use .SDcols to include all columns, even the ones used > by by : > > DT[, lapply(.SD, sum), by=list(grp=x%/%2)] > grp y > 1: 0 6 > 2: 1 15 > 3: 2 19 > > DT[, lapply(.SD, sum), by=list(grp=x%/%2), .SDcols=names(DT)] > grp x y > 1: 0 1 6 > 2: 1 5 15 > 3: 2 9 19 > > DT[, print(.SD), by = list(grp = x %/% 2), .SDcols=names(DT)] > x y > 1: 1 6 > x y > 1: 2 7 > 2: 3 8 > x y > 1: 4 9 > 2: 5 10 > > Arun > > > > > > > > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
