Ok, good idea. Naturally that leads to allowing expressions of column names in mult, too, I guess. Hm. That can't easily be done at that point in the code, so perhaps better to leave that to .SD[i]. Maybe at some point .SD subsetting could be made more efficient internally to evaluate .SD's i first and then create .SD. Also, I don't know if there's scope to add .SDrows, but hopefully mult covers that.
FR added : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1735&group_id=240&atid=978 On Sun, 2012-01-08 at 14:14 -0800, Steven C. Bagley wrote: > I like it. It is starting to look the same as the standard vector indexing > operations, and it might be useful to think through all of what exists there > already. For example, what about allowing logical vectors as well in the > standard way, so mult=c(TRUE,FALSE) would select odd numbered rows? > > --Steve > > On Jan 8, 2012, at 11:25 AM, Matthew Dowle wrote: > > > How about allowing mult to be integer (or an expression that evaluates > > to integer) : > > > > DT[X, mult="first"] > > DT[X, mult=1L] # same > > > > DT[X, mult="last"] > > DT[X, mult=.N] # same > > > > DT[X, .SD[2]] # 2nd row of each group (inefficient due to .SD, there > > are other longer alternatives) > > DT[X, mult=2L] # same, but efficient and simple > > > > DT[X, mult="random"] > > DT[X, mult=sample(.N,size=1)] # same, but more general > > > > DT[X, mult=-1L] # all but the first of each group > > > > Matthew > > > > On Sat, 2012-01-07 at 19:07 -0800, Steven C. Bagley wrote: > >> The mult argument is becoming its own little programming language. I > >> worry that this is going to get complicated in an ad hoc way. What if > >> someone wants random, but with weighting? Each new value of mult is > >> really shorthand for an R language construct. Maybe there is a more > >> general way to express these ideas using existing R constructs? (I'm > >> not sure how to do this consistently. I'm merely making an > >> observation.) > >> > >> > >> --Steve > >> > >> > >> On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote: > >> > >>> Thanks for your feedback. @Chris: I guess Matthew's example and > >>> your's do not really match because he doesn't call sample(dt,...), > >>> but sample(dt[i, which=TRUE],... His option, though, returns all the > >>> rows that match between the keys of dt and i and takes a random > >>> sample of size 1 from that, so I guess it does what I expected. > >>> Nevertheless, I think an option mult="random" would still be useful. > >>> Here is why: > >>> > >>> > >>> I guess my first example was a little bit too simplistic, sorry for > >>> that! Here is an updated, more realistic example of what I do and > >>> some hints about my current implementation of mult="random": > >>> > >>> > >>> require(data.table) > >>> rawData <- data.table(fundID = 1:1e5, > >>> Year = rep(1:10, times=1e4), > >>> key = "Year") > >>> #Let's have 10000 runs; in each run we want to draw a fund with a > >>> year that is > >>> #set dynamically > >>> intJoin <- J(sample(1:10, size=10000, replace=TRUE)) > >>> > >>> > >>> #Best solution I have come up so far with the current options in > >>> data.table > >>> #Is there one that can beat mult="random" and is easy for the user > >>> to implement? > >>> foo1 <- function(n, intJoin, rawData) { > >>> x <- integer(n) > >>> for (r in seq_len(nrow(intJoin))) { > >>> x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1) > >>> } > >>> return(rawData[x]) > >>> } > >>> system.time(finalData <- foo1(10000, intJoin, rawData)) > >>> # user system elapsed > >>> # 43.827 0.000 44.232 > >>> #Check that it does what it should: match random entities to the > >>> exact year in intJoin > >>> cbind(finalData, intJoin) > >>> # fundID Year V1 > >>> # [1,] 46556 6 6 > >>> # [2,] 77642 2 2 > >>> # [3,] 17325 5 5 > >>> # [4,] 36617 7 7 > >>> # [5,] 90697 7 7 > >>> # [6,] 4536 6 6 > >>> # [7,] 22273 3 3 > >>> # [8,] 46825 5 5 > >>> # [9,] 65788 8 8 > >>> # [10,] 14153 3 3 > >>> > >>> > >>> #My implementation of mult="random" > >>> system.time(finalData <- rawData[intJoin, mult="random"]) > >>> # user system elapsed > >>> # 0.324 0.016 0.337 > >>> #Pretty fast and easy to understand > >>> #Check that it does what it should: match random entities to the > >>> exact year in intJoin > >>> cbind(finalData, intJoin) > >>> # Year fundID V1 > >>> # [1,] 6 39626 6 > >>> # [2,] 2 98552 2 > >>> # [3,] 5 85425 5 > >>> # [4,] 7 24637 7 > >>> # [5,] 7 74797 7 > >>> # [6,] 6 87626 6 > >>> # [7,] 3 88973 3 > >>> # [8,] 5 60335 5 > >>> # [9,] 8 62298 8 > >>> # [10,] 3 23283 3 > >>> > >>> > >>> If you want to try it out yourself: Just call > >>> > >>> > >>> fixInNamespace("[.data.table", pos="package:data.table") > >>> > >>> > >>> and change the following lines in the editor (this applies to > >>> data.table 1.7.7): > >>> > >>> OLD LINE: if (!mult %in% c("first", "last", "all")) stop("mult > >>> argument can only be 'first','last' or 'all'") > >>> NEW LINE: if (!mult %in% c("first","last","all", "random")) > >>> stop("mult argument can only be 'first','last', 'all', or 'random'") > >>> > >>> > >>> and > >>> > >>> > >>> OLD LINES: else { > >>> irows = if (mult == "first") > >>> idx.start > >>> else idx.end > >>> lengths = rep(1L, length(irows)) > >>> } > >>> > >>> > >>> NEW LINES: } else if (mult=="first") { > >>> irows = idx.start > >>> lengths=rep(1L,length(irows)) > >>> } else if (mult=="last") { > >>> irows = idx.end > >>> lengths=rep(1L,length(irows)) > >>> } else { > >>> irows = mapply(function(x1, x2) {sample(x1:x2, > >>> size=1)}, idx.start, idx.end) > >>> lengths = rep(1L,length(irows)) > >>> } > >>> > >>> However, I don't know what's going on in the line > >>> .Call("binarysearch", i, x, as.integer(leftcols - > >>> 1), as.integer(rightcols - 1), haskey(i), roll, > >>> rolltolast, idx.start, idx.end, PACKAGE = > >>> "data.table") > >>> > >>> > >>> I figured out that idx.start and idx.end are changed with this > >>> function call and I guess at this point in the function it should > >>> always be that idx.start and idx.end are of the same lenght and both > >>> return only integer values that represent rows of x, but here I'm > >>> not 100% sure. So maybe additional checks are needed in the else > >>> clause when the mapply-function is called. > >>> > >>> > >>> So let me know what you think. I will join the project independent > >>> of that particular issue and try to help, but I guess I should start > >>> with simple things. So if there is any help needed on documentation > >>> checking > >>> or stuff like that, just let me know and I try my best! > >>> > >>> Christoph > >>> > >>> On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <[email protected]> wrote: > >>> That isn't doing quite what he does. I don't know what you > >>> expected > >>> > >>> sample(dt, size=1) > >>> > >>> to do but it seems to essentially do this: > >>> > >>> dt[sample(1:ncol(dt),size=1),] > >>> > >>> It picking a random column number and then return that row > >>> instead. > >>> Try it for yourself: > >>> > >>> dt=data.table(x=1:10,y=1:10,z=1:10) > >>> sample(dt, size=1) > >>> > >>> The only rows you will get is 1,1,1 2,2,2 and 3,3,3. Caveat > >>> as usual > >>> is I'm on 1.7.1 until my crashing bug is fixed so apologies > >>> if this > >>> works properly in later versions. > >>> > >>> Note that this diverges from what sample(df, size=1) does, > >>> which is > >>> picks a random column and returns that whole column. > >>> > >>> What he really wants is to pick a random row from each > >>> subset (I > >>> think). None of your examples do that and I can't think of a > >>> simpler > >>> way than what he suggests. > >>> > >>> On 6 January 2012 03:34, Matthew Dowle > >>> <[email protected]> wrote: > >>>> Very keen for direct contributions in that way, happy to > >>> help you with > >>>> svn etc, and you joining the project. > >>>> > >>>> In this particular example, how about : > >>>> > >>>> rawData[sample(rawData[J("eu"), which=TRUE],size=1)] > >>>> > >>>> This solves the inefficiency of the 1st step; i.e., > >>>> intDT <- rawData[J("eu"), mult="all"] > >>>> which copies a subset of all the columns, whilst retaining > >>> flexibility > >>>> for the user so user can easily sample 2 rows, or any > >>> other R method to > >>>> select a random subset. > >>>> > >>>> Because of potential scoping conflicts (say a column was > >>> called > >>>> "rawData" i.e. the same name of the table), to be more > >>> robust : > >>>> > >>>> x = sample(rawData[J("eu"), which=TRUE],size=1) > >>>> rawData[x] > >>>> > >>>> This is slightly different because when i is a single name > >>> (x in this > >>>> case), data.table knows the caller must mean the x in > >>> calling scope, not > >>>> the column called "x" (if any). Is two steps like this > >>> ok? I'm > >>>> guessing it was really the inefficiency that was the > >>> motivation? > >>>> > >>>> Matthew > >>>> > >>>> On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote: > >>>>> Hi together, > >>>>> > >>>>> > >>>>> I run a Monte Carlo simulation on a data.table and do > >>> that currently > >>>>> with a loop: on every run, I choose a subset of rows > >>> subject to > >>>>> certain criteria and from those rows I take a random > >>> element. > >>>>> Currently, I do the following: Let's say I have funds > >>> from two regions > >>>>> ("eu" and "us") and I want to choose a random fund from > >>> "eu" (could be > >>>>> "us" in the next run and a different region in the > >>> third): > >>>>> > >>>>> > >>>>> library(data.table) > >>>>> rawData <- data.table(fundID = letters, > >>>>> compGeo = rep(c("us", "eu"), > >>> each=13)) > >>>>> setkey(rawData, "compGeo") > >>>>> intDT <- rawData[J("eu"), mult="all"] > >>>>> intDT[sample.int(nrow(intDT), size=1)] > >>>>> > >>>>> > >>>>> So my idea is to just give the user the option > >>> mult="random", which > >>>>> does this in one step. What do you think about that > >>> feature request? > >>>>> > >>>>> > >>>>> With respect to the implementation: I changed a few lines > >>> in the > >>>>> function '[.data.table' and got this to run on my locale > >>> data.table > >>>>> version, so I guess I could implement it (as far as I can > >>> see, one > >>>>> just needs to change some R code). However, I haven't > >>> done extensive > >>>>> testing and I'm not an expert on shared projects and > >>> subversion (never > >>>>> did that actually), so I guess I would need some help to > >>> start with > >>>>> and the confirmation I couldn't break anything ;-) > >>>>> > >>>>> > >>>>> Christoph > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> datatable-help mailing list > >>>>> [email protected] > >>>>> > >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>>> > >>>> > >>>> _______________________________________________ > >>>> datatable-help mailing list > >>>> [email protected] > >>>> > >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> datatable-help mailing list > >>> [email protected] > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > >> _______________________________________________ > >> datatable-help mailing list > >> [email protected] > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
