Hi Vincent, I linked to the SO post to get an idea of how to use .I. I dint mean to say that it's exactly what you're looking for. which.max returns the first index of the max value (even if there are multiple identical max values). So, it might make sense that the results are not identical.
Looking at what you're trying to do with your code, these are the two ways I'd approach it. I can't really tell which one's faster for your dataset. But it'd be great if you could post your benchmarks on these two methods. Method 1: # .I[.N] will get the running row number for every group's last index ddt[ddt[, .I[.N], by=by]$V1] Method 2: # since you've already keyed your data.table, take adv. of the mult="last" option: ddt[J(unique(cyl)), mult="last"] Arun From: Vincent Carey Vincent Carey Reply: Vincent Carey [email protected] Date: March 10, 2014 at 3:04:31 PM To: Arunkumar Srinivasan [email protected] Subject: Re: [datatable-help] checking an approach to filtering rows in a data.table Thanks Arun, I like your approach, and I had looked at the possibility, although I had not seen the SO posting, which is indeed relevant. The .I solution seemed underperformant relative to expectations, particularly for millions of rows. Here are some timings for 2-300k rows. > litd = disc_allc200k_dt[1:200000,] > microbenchmark(rowsWmaxVinG( litd, "score", "snp" )) Unit: milliseconds expr min lq median uq rowsWmaxVinG(litd, "score", "snp") 86.83909 87.45823 88.16629 89.26693 max neval 440.0069 100 > microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ]) Unit: milliseconds expr min lq median uq litd[litd[, .I[which.max(score)], snp]$V1] 241.3669 252.2612 279.342 602.113 max neval 657.7055 100 > litd = disc_allc200k_dt[1:300000,] > microbenchmark(rowsWmaxVinG( litd, "score", "snp" )) Unit: milliseconds expr min lq median uq rowsWmaxVinG(litd, "score", "snp") 119.6237 120.9789 121.6302 122.7155 max neval 489.1918 100 > microbenchmark(litd[litd[, .I[which.max(score)], snp]$V1 ]) Unit: milliseconds expr min lq median uq litd[litd[, .I[which.max(score)], snp]$V1] 324.7394 347.5972 684.6746 693.456 max neval 1607.186 100 The two approaches do not agree in terms of values returned when there are ties in the score within groups. But otherwise the .N based approach seems to work. I would like to verify that setkeyv accomplishes the sorting necessary for the .N based approach to be valid. > sessionInfo() R Under development (unstable) (2014-02-02 r64913) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices datasets utils tools methods [8] base other attached packages: [1] microbenchmark_1.3-0 data.table_1.9.2 weaver_1.29.1 [4] codetools_0.2-8 digest_0.6.4 BiocInstaller_1.13.3 loaded via a namespace (and not attached): [1] Rcpp_0.11.0 plyr_1.8.1 reshape2_1.2.2 stringr_0.6.2 On Mon, Mar 10, 2014 at 9:08 AM, Arunkumar Srinivasan <[email protected]> wrote: Hi Vincent, Have you checked out the special variable `.I`? Have a look at `?data.table`. This SO post may also be relevant: http://stackoverflow.com/questions/21198937/subset-data-table-using-min-condition/21199009#21199009 Arun From: Vincent Carey Vincent Carey Reply: Vincent Carey [email protected] Date: March 10, 2014 at 4:33:27 AM To: [email protected] [email protected] Subject: [datatable-help] checking an approach to filtering rows in a data.table I have looked around for code on row filtering with data.table, but have not found anything addressing this use case. I want to retrieve the rows satisfying a certain condition within groups, in this case having the maximum value for a specific variable. The following seems to work, but I wonder if there is a more direct approach. rowsWmaxVinG = function(dt, V, by) { # # filter dt to the rows possessing max value of # variable V within groups formed using by # # example: data(mtcars) # ddt = data.table(mtcars) #> rowsWmaxVinG( ddt, by="cyl", V="mpg") # mpg cyl disp hp drat wt qsec vs am gear carb #1: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #2: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #3: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 # setkeyv(dt, c(by, V)) # sort within groups dt[ cumsum(dt[, .N, by=by]$N), ] # take last row from each group } _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
