I have some code that can potentially produce a huge number of large-ish R data frames, each of a different number of rows. All the data frames together will be way too big to keep in R's memory, but we'll assume a single one is manageable. It's just when there's a million of them that the machine might start to burn up.
However I might, for example, want to compute some averages over the elements in the data frames. Or I might want to sample ten of them at random and do some plots. What I need is rapid random access to data stored in external files. Here's some ideas I've had: * Store all the data in an HDF-5 file - problem here is that the current HDF package for R reads the whole file in at once. * Store the data in some other custom binary format with an index for rapid access to the N-th elements. Problems: feels like reinventing HDF, cross-platform issues, etc. * Store the data in a number of .RData files in a directory. Hence to get the N-th element just attach(paste("foo/A-",n,'.RData')) give or take a parameter or two. * Use a database. Seems a bit heavyweight, but maybe using RSQLite could work in order to keep it local. What I'm currently doing is keeping it OO enough that I can in theory implement all of the above. At the moment I have an implementation that does keep them all in R's memory as a list of data frames, which is fine for small test cases but things are going to get big shortly. Any other ideas or hints are welcome. thanks Barry ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel