Re: [datatable-help] columns in .SD with grouping ad-hoc using "by"

Arunkumar Srinivasan Sun, 12 May 2013 06:14:53 -0700

Matthew, 

Yes, that clarifies things. It makes more sense, especially with the option of 
being able to use `.SDcols` to include it. And thanks for the change in 
documentation as well; adds more clarity.


Best,
Arun


On Sunday, May 12, 2013 at 2:45 PM, Matthew Dowle wrote:

> On 12.05.2013 12:54, Arunkumar Srinivasan wrote:
> > I just realised that I sent it only to MatthewDowle. So, sending it again. 
> > Sorry @Matthew for the double email.
> > Matthew,
> > >> .BY is available to j already for that reason, does that work? .BY isn't 
> > >> a column of .SD because i) it's the same value for every row of .SD i.e. 
> > >> .BY[[1]] is length 1 and contains this particular group (replicating the 
> > >> same value would be wasteful)
> > DT[, print(.BY), by = list(grp = x %/% 2)]
> > $grp
> > [1] 0
> > $grp
> > [1] 1
> > $grp
> > [1] 2
> > 
> > DT[, print(.SD), by = list(grp = x %/% 2)] # no column "x"
> >    y
> > 1: 6
> >    y
> > 1: 7
> > 2: 8
> >     y
> > 1:  9
> > 2: 10
> > 
> > My question is not as to why the BY column is not available in .SD. Rather, 
> > since .BY does not have column "x" in it (rather the result of x%/% 2), why 
> > does .SD not have "x"? It's as if grp = x%/%2 is a "new column". So, "x" 
> > should be available to .SD is my point.
> > 
> > 
> > 
> 
> Oh I see now.  Yes data.table inspects the expressions used in 'by' and 
> considers any columns used there as grouping columns and excludes those from 
> .SD.  An example is a date column containing daily observations.  DT[, 
> lapply(.SD,sum), by=month(date)] would not wish to sum() the "date" column.
> In ?data.table I've just changed :
> .SD is a data.table containing the Subset of x's Data for each group, 
> excluding the group column(s).
> to
> .SD is a data.table containing the Subset of x's Data for each group, 
> excluding any columns used in 'by' (or 'keyby').
> Further answer below ...
> > >> but more significantly  ii) it is often a character group name where 
> > >> running an aggregation function like sum() would trip up on it.
> > Again, I don't think so because, I am not asking for .BY columns to be in 
> > .SD.
> > DT[, grp := x%/% = 2]
> > DT[, lapply(.SD, sum), by=grp]
> > must be equal to:
> > DT[, lapply(.SD, sum), by = list(grp = x%/%2)] # here, "x" should be 
> > available to .SD as it's not the grouping column
> > 
> > 
> > 
> 
> This makes sense in this case because x can be sum()-ed,  but isn't true in 
> general like the month(date) case above.
> In these cases you can use .SDcols to include all columns, even the ones used 
> by by :
> > DT[, lapply(.SD, sum), by=list(grp=x%/%2)]
> grp y
> 1: 0 6
> 2: 1 15
> 3: 2 19
> > DT[, lapply(.SD, sum), by=list(grp=x%/%2), .SDcols=names(DT)]
> grp x y
> 1: 0 1 6
> 2: 1 5 15
> 3: 2 9 19
> > DT[, print(.SD), by = list(grp = x %/% 2), .SDcols=names(DT)] 
> x y
> 1: 1 6
> x y
> 1: 2 7
> 2: 3 8
> x y
> 1: 4 9
> 2: 5 10 
> > Arun
> > 
> > 
> 
>  
>  
> 
>

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] columns in .SD with grouping ad-hoc using "by"

Reply via email to