Re: [datatable-help] What's your opinion on the feature request: add option mult="random"

Matthew Dowle Mon, 09 Jan 2012 15:03:04 -0800

Ok, good idea. Naturally that leads to allowing expressions of column
names in mult, too, I guess. Hm. That can't easily be done at that point
in the code, so perhaps better to leave that to .SD[i]. Maybe at some
point .SD subsetting could be made more efficient internally to
evaluate .SD's i first and then create .SD.  Also, I don't know if
there's scope to add .SDrows, but hopefully mult covers that.


FR added :

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1735&group_id=240&atid=978



On Sun, 2012-01-08 at 14:14 -0800, Steven C. Bagley wrote:
> I like it. It is starting to look the same as the standard vector indexing 
> operations, and it might be useful to think through all of what exists there 
> already. For example, what about allowing logical vectors as well in the 
> standard way, so mult=c(TRUE,FALSE) would select odd numbered rows? 
> 
> --Steve
> 
> On Jan 8, 2012, at 11:25 AM, Matthew Dowle wrote:
> 
> > How about allowing mult to be integer (or an expression that evaluates
> > to integer) :
> > 
> > DT[X, mult="first"]
> > DT[X, mult=1L]   # same
> > 
> > DT[X, mult="last"]
> > DT[X, mult=.N]   # same
> > 
> > DT[X, .SD[2]]  # 2nd row of each group (inefficient due to .SD, there
> > are other longer alternatives)
> > DT[X, mult=2L] # same, but efficient and simple
> > 
> > DT[X, mult="random"]
> > DT[X, mult=sample(.N,size=1)]  # same, but more general
> > 
> > DT[X, mult=-1L]   # all but the first of each group
> > 
> > Matthew
> > 
> > On Sat, 2012-01-07 at 19:07 -0800, Steven C. Bagley wrote:
> >> The mult argument is becoming its own little programming language. I
> >> worry that this is going to get complicated in an ad hoc way. What if
> >> someone wants random, but with weighting? Each new value of mult is
> >> really shorthand for an R language construct. Maybe there is a more
> >> general way to express these ideas using existing R constructs? (I'm
> >> not sure how to do this consistently. I'm merely making an
> >> observation.)
> >> 
> >> 
> >> --Steve
> >> 
> >> 
> >> On Jan 6, 2012, at 5:58 AM, Christoph Jäckel wrote:
> >> 
> >>> Thanks for your feedback. @Chris: I guess Matthew's example and
> >>> your's do not really match because he doesn't call sample(dt,...),
> >>> but sample(dt[i, which=TRUE],... His option, though, returns all the
> >>> rows that match between the keys of dt and i and takes a random
> >>> sample of size 1 from that, so I guess it does what I expected.
> >>> Nevertheless, I think an option mult="random" would still be useful.
> >>> Here is why:
> >>> 
> >>> 
> >>> I guess my first example was a little bit too simplistic, sorry for
> >>> that! Here is an updated, more realistic example of what I do and
> >>> some hints about my current implementation of mult="random":
> >>> 
> >>> 
> >>> require(data.table)
> >>> rawData <- data.table(fundID = 1:1e5,
> >>>                      Year   = rep(1:10, times=1e4),
> >>>                      key    = "Year")
> >>> #Let's have 10000 runs; in each run we want to draw a fund with a
> >>> year that is 
> >>> #set dynamically
> >>> intJoin <- J(sample(1:10, size=10000, replace=TRUE))
> >>> 
> >>> 
> >>> #Best solution I have come up so far with the current options in
> >>> data.table
> >>> #Is there one that can beat mult="random" and is easy for the user
> >>> to implement?
> >>> foo1 <- function(n, intJoin, rawData) {
> >>>    x <- integer(n)
> >>>    for (r in seq_len(nrow(intJoin))) {
> >>>      x[r] <- sample(rawData[intJoin[r], which=TRUE], size=1)
> >>>    }
> >>>    return(rawData[x])
> >>> }
> >>> system.time(finalData <- foo1(10000, intJoin, rawData))
> >>> #    user  system elapsed 
> >>> #  43.827   0.000  44.232
> >>> #Check that it does what it should: match random entities to the
> >>> exact year in intJoin
> >>> cbind(finalData, intJoin)
> >>> #       fundID Year V1
> >>> #  [1,]  46556    6  6
> >>> #  [2,]  77642    2  2
> >>> #  [3,]  17325    5  5
> >>> #  [4,]  36617    7  7
> >>> #  [5,]  90697    7  7
> >>> #  [6,]   4536    6  6
> >>> #  [7,]  22273    3  3
> >>> #  [8,]  46825    5  5
> >>> #  [9,]  65788    8  8
> >>> # [10,]  14153    3  3
> >>> 
> >>> 
> >>> #My implementation of mult="random"
> >>> system.time(finalData <- rawData[intJoin, mult="random"])  
> >>> #   user  system elapsed 
> >>> #  0.324   0.016   0.337
> >>> #Pretty fast and easy to understand
> >>> #Check that it does what it should: match random entities to the
> >>> exact year in intJoin
> >>> cbind(finalData, intJoin)
> >>> #       Year fundID V1
> >>> #  [1,]    6  39626  6
> >>> #  [2,]    2  98552  2
> >>> #  [3,]    5  85425  5
> >>> #  [4,]    7  24637  7
> >>> #  [5,]    7  74797  7
> >>> #  [6,]    6  87626  6
> >>> #  [7,]    3  88973  3
> >>> #  [8,]    5  60335  5
> >>> #  [9,]    8  62298  8
> >>> # [10,]    3  23283  3
> >>> 
> >>> 
> >>> If you want to try it out yourself: Just call
> >>> 
> >>> 
> >>> fixInNamespace("[.data.table", pos="package:data.table")
> >>> 
> >>> 
> >>> and change the following lines in the editor (this applies to
> >>> data.table 1.7.7):
> >>> 
> >>> OLD LINE:     if (!mult %in% c("first", "last", "all")) stop("mult
> >>> argument can only be 'first','last' or 'all'")  
> >>> NEW LINE:     if (!mult %in% c("first","last","all", "random"))
> >>> stop("mult argument can only be 'first','last', 'all', or 'random'")
> >>> 
> >>> 
> >>> and 
> >>> 
> >>> 
> >>> OLD LINES: else {
> >>>                irows = if (mult == "first") 
> >>>                  idx.start
> >>>                else idx.end
> >>>                lengths = rep(1L, length(irows))
> >>>            }
> >>> 
> >>> 
> >>> NEW LINES:  } else if (mult=="first") { 
> >>>              irows = idx.start
> >>>              lengths=rep(1L,length(irows))
> >>>            } else if (mult=="last") {
> >>>              irows = idx.end
> >>>              lengths=rep(1L,length(irows))
> >>>            } else {
> >>>              irows = mapply(function(x1, x2) {sample(x1:x2,
> >>> size=1)}, idx.start, idx.end)
> >>>              lengths = rep(1L,length(irows))
> >>>            }
> >>> 
> >>> However, I don't know what's going on in the line
> >>> .Call("binarysearch", i, x, as.integer(leftcols - 
> >>>                1), as.integer(rightcols - 1), haskey(i), roll, 
> >>>                rolltolast, idx.start, idx.end, PACKAGE =
> >>> "data.table")
> >>> 
> >>> 
> >>> I figured out that idx.start and idx.end are changed with this
> >>> function call and I guess at this point in the function it should
> >>> always be that idx.start and idx.end are of the same lenght and both
> >>> return only integer values that represent rows of x, but here I'm
> >>> not 100% sure. So maybe additional checks are needed in the else
> >>> clause when the mapply-function is called.
> >>> 
> >>> 
> >>> So let me know what you think. I will join the project independent
> >>> of that particular issue and try to help, but I guess I should start
> >>> with simple things. So if there is any help needed on documentation
> >>> checking
> >>> or stuff like that, just let me know and I try my best!
> >>> 
> >>> Christoph
> >>> 
> >>> On Fri, Jan 6, 2012 at 1:52 PM, Chris Neff <[email protected]> wrote:
> >>>        That isn't doing quite what he does.  I don't know what you
> >>>        expected
> >>> 
> >>>        sample(dt, size=1)
> >>> 
> >>>        to do but it seems to essentially do this:
> >>> 
> >>>        dt[sample(1:ncol(dt),size=1),]
> >>> 
> >>>        It picking a random column number and then return that row
> >>>        instead.
> >>>        Try it for yourself:
> >>> 
> >>>        dt=data.table(x=1:10,y=1:10,z=1:10)
> >>>        sample(dt, size=1)
> >>> 
> >>>        The only rows you will get is 1,1,1 2,2,2 and 3,3,3.  Caveat
> >>>        as usual
> >>>        is I'm on 1.7.1 until my crashing bug is fixed so apologies
> >>>        if this
> >>>        works properly in later versions.
> >>> 
> >>>        Note that this diverges from what sample(df, size=1) does,
> >>>        which is
> >>>        picks a random column and returns that whole column.
> >>> 
> >>>        What he really wants is to pick a random row from each
> >>>        subset (I
> >>>        think). None of your examples do that and I can't think of a
> >>>        simpler
> >>>        way than what he suggests.
> >>> 
> >>>        On 6 January 2012 03:34, Matthew Dowle
> >>>        <[email protected]> wrote:
> >>>> Very keen for direct contributions in that way, happy to
> >>>        help you with
> >>>> svn etc, and you joining the project.
> >>>> 
> >>>> In this particular example, how about :
> >>>> 
> >>>>   rawData[sample(rawData[J("eu"), which=TRUE],size=1)]
> >>>> 
> >>>> This solves the inefficiency of the 1st step; i.e.,
> >>>>   intDT <- rawData[J("eu"), mult="all"]
> >>>> which copies a subset of all the columns, whilst retaining
> >>>        flexibility
> >>>> for the user so user can easily sample 2 rows, or any
> >>>        other R method to
> >>>> select a random subset.
> >>>> 
> >>>> Because of potential scoping conflicts (say a column was
> >>>        called
> >>>> "rawData" i.e. the same name of the table), to be more
> >>>        robust :
> >>>> 
> >>>> x = sample(rawData[J("eu"), which=TRUE],size=1)
> >>>> rawData[x]
> >>>> 
> >>>> This is slightly different because when i is a single name
> >>>        (x in this
> >>>> case), data.table knows the caller must mean the x in
> >>>        calling scope, not
> >>>> the column called "x" (if any).  Is two steps like this
> >>>        ok?  I'm
> >>>> guessing it was really the inefficiency that was the
> >>>        motivation?
> >>>> 
> >>>> Matthew
> >>>> 
> >>>> On Fri, 2012-01-06 at 00:20 +0100, Christoph Jäckel wrote:
> >>>>> Hi together,
> >>>>> 
> >>>>> 
> >>>>> I run a Monte Carlo simulation on a data.table and do
> >>>        that currently
> >>>>> with a loop: on every run, I choose a subset of rows
> >>>        subject to
> >>>>> certain criteria and from those rows I take a random
> >>>        element.
> >>>>> Currently, I do the following: Let's say I have funds
> >>>        from two regions
> >>>>> ("eu" and "us") and I want to choose a random fund from
> >>>        "eu" (could be
> >>>>> "us" in the next run and a different region in the
> >>>        third):
> >>>>> 
> >>>>> 
> >>>>> library(data.table)
> >>>>> rawData <- data.table(fundID  = letters,
> >>>>>                      compGeo = rep(c("us", "eu"),
> >>>        each=13))
> >>>>> setkey(rawData, "compGeo")
> >>>>> intDT <- rawData[J("eu"), mult="all"]
> >>>>> intDT[sample.int(nrow(intDT), size=1)]
> >>>>> 
> >>>>> 
> >>>>> So my idea is to just give the user the option
> >>>        mult="random", which
> >>>>> does this in one step. What do you think about that
> >>>        feature request?
> >>>>> 
> >>>>> 
> >>>>> With respect to the implementation: I changed a few lines
> >>>        in the
> >>>>> function '[.data.table' and got this to run on my locale
> >>>        data.table
> >>>>> version, so I guess I could implement it (as far as I can
> >>>        see, one
> >>>>> just needs to change some R code). However, I haven't
> >>>        done extensive
> >>>>> testing and I'm not an expert on shared projects and
> >>>        subversion (never
> >>>>> did that actually), so I guess I would need some help to
> >>>        start with
> >>>>> and the confirmation I couldn't break anything ;-)
> >>>>> 
> >>>>> 
> >>>>> Christoph
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> datatable-help mailing list
> >>>>> [email protected]
> >>>>> 
> >>>        
> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> datatable-help mailing list
> >>>> [email protected]
> >>>> 
> >>>        
> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>> 
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> datatable-help mailing list
> >>> [email protected]
> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> 
> >> _______________________________________________
> >> datatable-help mailing list
> >> [email protected]
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > 
> 


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] What's your opinion on the feature request: add option mult="random"

Reply via email to