Hello Andy,

thanks for your examples, I rewrote everything to matrices & lapply/sapply, rbind calls instead of for-cycles & appends, it really helped. Reading files one by one and concatenating is now even faster than concatenating on disk, that 8MB table is read in 3.5 seconds.

Tomas

rbind is vectorized so you are using it (way) suboptimally.



Here's an example:



## Create a 500 x 100 data matrix.
x <- matrix(rnorm(5e4), 500, 100)
## Generate 50 filenames.
fname <- paste("f", formatC(1:50, width=2, flag="0"), ".txt", sep="")
## Write the data to files 50 times.
for (f in fname) write(t(x), file=f, ncol=ncol(x))

## Read the files into a list of data frames.
system.time(datList <- lapply(fname, read.table, header=FALSE),


gcFirst=TRUE)
[1] 11.91 0.05 12.33 NA NA


## Specify colClasses to speed up.
system.time(datList <- lapply(fname, read.table,


colClasses=rep("numeric", 100)),
+ gcFirst=TRUE)
[1] 10.69 0.07 10.79 NA NA


## Stack them together.
system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)


[1] 5.34 0.09 5.45 NA NA



## Use matrices instead of data frames.
system.time(datList <- lapply(fname,


+ function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE)
Read 50000 items
...
Read 50000 items
[1] 9.49 0.08 15.06 NA NA


system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)


[1] 0.09 0.03 0.12 NA NA


## Clean up the files.
unlink(fname)



A couple of points:

- Usually specifying colClasses will make read.table() quite a bit faster, even though it's only marginally faster here. Look back
in the list archive to see examples.


- If your data files are all numerics (as in this example), storing them in matrices will be much more efficient. Note
the difference in rbind()ing the 50 data frames and 50 matrices (5.34 seconds vs. 0.09!). rbind.data.frame()
needs to ensure that the resulting data frame has unique
rownames (a requirement for a legit data frame), and
that's probably taking a big chunk of the time.


Andy





______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to