For me with ff - on a 3 GB notebook - 3e6x100 works out of the box even without
compression: doubles consume 2.2 GB on disk, but the R process remains under
100MB, rest of RAM used by file-system-cache.
If you are under windows, you can create the ffdf files in a compressed folder.
For the random doubles this reduces size on disk to 230MB - which should even
work on a 1GB notebook.
BTW: the most compressed datatype (vmode) that can handle NAs is "logical":
consumes 2bit per tri-bool. The nextmost compressed is "byte" covering c(NA,
-127:127) and consuming its name on disk and in fs-cache.
The code below should give an idea of how to do pairwise stats on columns where
each pair fits easily into RAM. In the real world, you would not create the
data but import it using read.csv.ffdf (expect that reading your file takes
longer than reading/writing the ffdf).
Regards
Jens Oehlschlägel
library(ff)
k <- 100
n <- 3e6
# creating a ffdf dataframe of the requires size
l <- vector("list", k)
for (i in 1:k)
l[[i]] <- ff(vmode="double", length=n, update=FALSE)
names(l) <- paste("c", 1:k, sep="")
d <- do.call("ffdf", l)
# writing 100 columns of 1e6 random data takes 90 sec
system.time(
for (i in 1:k){
cat(i, " ")
print(system.time(d[,i] <- rnorm(n))["elapsed"])
}
)["elapsed"]
m <- matrix(as.double(NA), k, k)
# pairwise correlating one column against all others takes ~ 17.5 sec
# pairwise correlating all combinations takes 15 min
system.time(
for (i in 2:k){
cat(i, " ")
print(system.time({
x <- d[[i]][]
for (j in 1:(i-1)){
m[i,j] <- cor(x, d[[j]][])
}
})["elapsed"])
}
)["elapsed"]
--
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.