[R] Fastest way to calculate quantile in large data.table

Camilo Mora Wed, 04 Feb 2015 22:02:19 -0800

Hi everyone,

I have a data.table with 200 columns and few million rows and am trying to 
calculate the .1 and .9 quantiles for each row across all 200 columns.


I have found different ways to do this, all with different performances. The 
examples I used are below. I wonder whether there is a faster way to do this?

Thanks and best,

Camilo


library(data.table)
v <- data.table(x=runif(10000),x2 = runif(10000),  
x3=runif(10000),x4=runif(10000))
v[,Names:=rownames(v)]

#test 1 using .SD but not .SDcols
Sys.time()->StartTEST1
v[,  as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names]
Sys.time()->EndTEST1

#test 2 using .SD and .SDcols
Sys.time()->StartTEST2
v[,  as.list(quantile(.SD,c(.1,.90),na.rm=TRUE)), by=Names,.SDcols=1:4]
Sys.time()->EndTEST2

#test 3 using colnames directly. This is the fastest I found
Sys.time()->StartTEST3
v[,  as.list(quantile(c(x ,       x2,        x3,        x4 
),c(.1,.90),na.rm=TRUE)), by=Names]
Sys.time()->EndTEST3

# melting the database and doing quantile by summary. This is the second 
fastest, which is ironic given that the database has to be melted first
library(reshape2)
Sys.time()->StartTEST4
vs<-melt(v)
vs[,  as.list(quantile(value,c(.1,.90),na.rm=TRUE)), by=Names]
Sys.time()->EndTEST4


EndTEST1-StartTEST1
EndTEST2-StartTEST2
EndTEST3-StartTEST3
EndTEST4-StartTEST4

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Fastest way to calculate quantile in large data.table

Reply via email to