On Thu, 9 Aug 2007, Charles C. Berry wrote: > On Thu, 9 Aug 2007, Michael Cassin wrote: > >> I really appreciate the advice and this database solution will be useful to >> me for other problems, but in this case I need to address the specific >> problem of scan and read.* using so much memory. >> >> Is this expected behaviour?
Yes, and documented in the 'R Internals' manual. That is basic reading for people wishing to comment on efficiency issues in R. >> Can the memory usage be explained, and can it be >> made more efficient? For what it's worth, I'd be glad to try to help if the >> code for scan is considered to be worth reviewing. > > Mike, > > This does not seem to be an issue with scan() per se. > > Notice the difference in size of big2, big3, and bigThree here: > >> big2 <- rep(letters,length=1e6) >> object.size(big2)/1e6 > [1] 4.000856 >> big3 <- paste(big2,big2,sep='') >> object.size(big3)/1e6 > [1] 36.00002 On a 32-bit computer every R object has an overhead of 24 or 28 bytes. Character strings are R objects, but in some functions such as rep (and scan for up to 10,000 distinct strings) the objects can be shared. More string objects will be shared in 2.6.0 (but factors are designed to be efficient at storing character vectors with few values). On a 64-bit computer the overhead is usually double. So I would expect just over 56 bytes/string for distinct short strings (and that is what big3 gives). But 56Mb is really not very much (tiny on a 64-bit computer), and 1 million items is a lot. [...] -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.