Hi Arun, Frank & Steve,
Thanks for responding to my post.
I did the 'microbenchmark' and 'debugonce(....)' using the same dataset 'dat2'.
f1 <- function (dataFrame) {
dataFrame[unlist(with(dataFrame, tapply(Time, list(Date), FUN =
function(x) x == max(x)))), ]
}
f2 <- function (dataFrame) {
dataFrame[cumsum(with(dataFrame, tapply(Time, list(Date), FUN =
which.max))), ]
}
isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
f3 <- function(dataFrame) {
dataFrame[ isLastInRun(dataFrame$Date), ]
}
f4<- function(dataFrame){
dataFrame[as.logical(with(dataFrame,ave(Time,Date,FUN=function(x)
x==max(x)))),]
}
f5<- function(dataFrame){
dataFrame[cumsum(rle(dataFrame[,1])$lengths),]
}
library(data.table)
dt1 <- data.table(dat2, key=c('Date', 'Time'))
f6<- function(dataTable){
dataTable[, .SD[.N], by='Date']}
f7<- function(dataTable){
dataTable[dataTable[, .I[.N], by='Date']$V1]
}
f8<- function(dataTable){
dataTable[J(unique(Date)),,mult='last']
}
f9<- function(dataTable){
dataTable[dataTable[, .I[.N], by='Date']$V1]
}
library(microbenchmark)
microbenchmark(f1(dat2),
f2(dat2),
f3(dat2),
f4(dat2),
f5(dat2),
f6(dt1),
f7(dt1),
f8(dt1),
f9(dt1),
times=100)
#Unit: milliseconds
# expr min lq median uq max neval
# f1(dat2) 2046.59313 2318.57397 2414.21020 2533.28214 2842.9609 100
# f2(dat2) 940.97742 1000.56395 1027.53096 1100.67961 1705.4570 100
# f3(dat2) 315.06253 325.02696 341.21953 364.85656 533.9347 100
# f4(dat2) 804.89703 858.14888 899.55182 964.39989 1129.9311 100
# f5(dat2) 149.55682 153.67846 167.23934 176.56643 292.3134 100
# f6(dt1) 46665.61046 48234.78637 48818.88141 49366.46810 51112.7930 100
###############################slowest
# f7(dt1) 71.02789 76.97008 85.09989 97.82982 387.3801 100
# f8(dt1) 77.74961 78.94773 80.00620 89.00892 205.2492 100
# f9(dt1) 71.76817 76.40184 79.89194 100.57348 282.8359 100
#Comparing the fastest among data.table with f5()
system.time(res8<- f8(dt1))
# user system elapsed
# 0.08 0.00 0.08
system.time(res5<- f5(dat2))
# user system elapsed
# 0.156 0.000 0.153
res8New<- as.data.frame(res8)
row.names(res8New)<- row.names(res5)
attr(res8New,"row.names")<- attr(res5,"row.names")
identical(res8New,res5)
#[1] TRUE
#During debugging: the step that took long time to execute is: (Same as Frank
reported)
debugonce(data.table:::`[.data.table`)
dt1[, .SD[.N], by='Date']
debug: ans = .Call(Cdogroups, x, xcols, groups, grpcols, jiscols, grporder,
o__, f__, len__, jsub, SDenv, cols, newnames, verbose)
#I use Linux mint 15.
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.8 microbenchmark_1.3-0 stringr_0.6.2
[4] reshape2_1.2.2
loaded via a namespace (and not attached):
[1] plyr_1.8 tcltk_3.0.1 tools_3.0.1
A.K.
Steve,
Thank you.
arun,
Could you run it with `microbenchmark` instead of system.time (with times = 100
or so) and paste the results here?
Also, maybe you could use debugonce(data.table:::`[.data.table`) and then run
x[, .SD[.N], by='Date']
to go step by step to find out the line that causes the lag, perhaps?
Arun
________________________________
From: Arunkumar Srinivasan <[email protected]>
To: arun <[email protected]>
Sent: Friday, August 16, 2013 2:27 AM
Subject: Re: [datatable-help] Slow execution: Extracting last value in each
group
Sorry, but I'm not sure what your question is here. There seems to be different
timings between you and Steve. You want to get it verified as to which one is
true? On my system, Steve's takes 0.003 seconds.
However, a *faster* version than Steve's solution (on bigger data) would be:
x[x[, .I[.N], by='Date']$V1]
Arun
On Friday, August 16, 2013 at 6:52 AM, arun wrote:
HI,
>This is a follow up from a post in R-help mailing list.
>(http://r.789695.n4.nabble.com/How-to-extract-last-value-in-each-group-td4673787.html).
>
>
>
>
>
>In short, I tried the below using data.table(), but found to be slower than
>some of the other methods. Steve Lianoglou also tried the same and got it
>much faster (system.time()~ 0.070 vs. ~40 ).
>
>
>###data
>
>
>dat1<- structure(list(Date = c("06/01/2010", "06/01/2010", "06/01/2010",
>"06/01/2010", "06/02/2010", "06/02/2010", "06/02/2010", "06/02/2010",
>"06/02/2010", "06/02/2010", "06/02/2010"), Time = c(1358L, 1359L,
>1400L, 1700L, 331L, 332L, 334L, 335L, 336L, 337L, 338L), O = c(136.4,
>136.4, 136.45, 136.55, 136.55, 136.7, 136.75, 136.8, 136.8, 136.75,
>136.8), H = c(136.4, 136.5, 136.55, 136.55, 136.7, 136.7, 136.75,
>136.8, 136.8, 136.8, 136.8), L = c(136.35, 136.35, 136.35, 136.55,
>136.5, 136.65, 136.75, 136.8, 136.8, 136.75, 136.8), C = c(136.35,
>136.5, 136.4, 136.55, 136.7, 136.65, 136.75, 136.8, 136.8, 136.8,
>136.8), U = c(2L, 9L, 8L, 1L, 36L, 3L, 1L, 4L, 8L, 1L, 3L), D = c(12L,
>6L, 7L, 0L, 6L, 1L, 0L, 0L, 0L, 2L, 0L)), .Names = c("Date",
>"Time", "O", "H", "L", "C", "U", "D"), class = "data.frame", row.names = c(NA,
>-11L))
>
>
>
>
>indx<- rep(1:nrow(dat1),1e5)
>dat2<- dat1[indx,]
>dat2[-c(1:11),1]<-format(rep(seq(as.Date("1080-01-01"),by=1,length.out=99999),each=11),"%m/%d/%Y")
> dat2<- dat2[order(dat2[,1],dat2[,2]),]
>row.names(dat2)<-1:nrow(dat2)
>
>
>
>
>
>
>#Some speed comparisons (more in the link):
>system.time(res1<-dat2[c(diff(as.numeric(as.factor(dat2$Date))),1)>0,])
># user system elapsed
> # 0.528 0.012 0.540
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
># user system elapsed
> # 0.156 0.000 0.155
>
>
>
>
>library(data.table)
>system.time({
>dt1 <- data.table(dat2, key=c('Date', 'Time'))
> ans <- dt1[, .SD[.N], by='Date']})
>
>
> # user system elapsed
> #39.860 0.020 39.952 #############slower than many other methods
>ans1<- as.data.frame(ans)
> row.names(ans1)<- row.names(res7)
> attr(ans1,"row.names")<- attr(res7,"row.names")
> identical(ans1,res7)
>#[1] TRUE
>
>
>
>
>
>
>
>
>Steve Lianoglou reply is below:
>############################
>
>
>
>
>Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
>specs to your machine):
>
>
>R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
>R> system.time(ans <- dt1[, .SD[.N], by='Date'])
> user system elapsed
> 0.064 0.009 0.073 ###########################
>
>
>R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> user system elapsed
> 0.148 0.016 0.165
>
>
>On one of our compute server running who knows what processor on some
>version of linux, but shouldn't really matter as we're talking
>relative time to each other here:
>
>
>R> system.time(ans <- dt1[, .SD[.N], by='Date'])
> user system elapsed
> 0.160 0.012 0.170
>
>
>R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
> user system elapsed
> 0.292 0.004 0.294
>##############################################
>
>
>My sessionInfo#######
>sessionInfo()
>R version 3.0.1 (2013-05-16)
>Platform: x86_64-unknown-linux-gnu (64-bit) (linux mint 15)
>
>
>locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
>[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
>
>attached base packages:
>[1] stats graphics grDevices utils datasets methods base
>
>
>other attached packages:
>[1] data.table_1.8.8 stringr_0.6.2 reshape2_1.2.2
>
>
>loaded via a namespace (and not attached):
>[1] plyr_1.8 tools_3.0.1
>
>
>CPU ####################
>I use Dell XPS L502X
> * Processor 2nd Gen Core i7 Intel i7-2630QM / 2 GHz ( 2.9 GHz ) ( Quad-Core )
> * Memory 6 GB / 8 GB (max)
> * Hard Drive 640 GB - Serial ATA-300 - 7200 rpm
>
>
>Any help will be appreciated.
>Thanks.
>A.K.
>_______________________________________________
>datatable-help mailing list
>[email protected]
>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help