Hi Steve,
Now that you've brought this back up, what do you think you would
prefer? For example, using my (admittedly contrived) original example:
result <- some.big.data.table[, by=list(colA, colB), {
## Sometimes I want to know what the current values of
## colA and colB are in here to get some more info. Mabye
## we can have .BY:
xref <- more.data[J(.BY[1], .BY[2]), mult='all'] ## or something
## ...
}]
Should it be `J(.BY[1], .BY[2])` or is something like `J(colA, colB)`
more natural, you think?
'J(colA, colB)' is perfect if you know the column names in advance. This
is not true in my case. I created a minimal example for a possible
application for a '.BY' construct:
> dt <- data.table(x=c(0,1,0,1), y=c(1,0,1,0))
> dt
x y
[1,] 0 1
[2,] 1 0
[3,] 0 1
[4,] 1 0
From this table, I want the row sum for each group, i.e. "select x + y
from dt group by x, y" in SQL. This would be:
> setkey(dt, x, y)
> dt[,sum(x[1], y[1]), by=list(x,y)]
x y V1
[1,] 0 1 1
[2,] 1 0 1
But what if dt can have an arbitrary number of (grouping) columns with
arbitrary names? If the grouping columns are given as
groupCols <- c("x", "y")
, the following is possible:
> expr <- parse(text = sprintf("sum(%s)", paste(groupCols, "[1]",
sep="", collapse=", ")))
> dt[,eval(expr), by=groupCols]
x y V1
[1,] 0 1 1
[2,] 1 0 1
Now, this is certainly uglier than
> dt[, sum(.BY), by = groupCols]
My actual application is that I apply decision tree models (rpart) to a
large number of binary patterns. In order to save computation time, I
classify each distinct pattern only once. So what I basically do is to
group by all attributes and apply the model once to each group.
Andreas