Thanks for all the comments, The artificial dataset is as representative of my 440MB file as I could design.
I did my best to reduce the complexity of my problem to minimal reproducible code as suggested in the posting guidelines. Having searched the archives, I was happy to find that the topic had been covered, where Prof Ripley suggested that the I/O manuals gave some advice. However, I was unable to get anywhere with the I/O manuals advice. I spent 6 hours preparing my post to R-help. Sorry not to have read the 'R-Internals' manual. I just wanted to know if I could use scan() more efficiently. My hurdle seems nothing to do with efficiently calling scan() . I suspect the same is true for the originator of this memory experiment thread. It is the overhead of storing short strings, as Charles identified and Brian explained. I appreciate the investigation and clarification you both have made. 56B overhead for a 2 character string seems extreme to me, but I'm not complaining. I really like R, and being free, accept that it-is-what-it-is. In my case pre-processing is not an option, it is not a one off problem with a particular file. In my application, R is run in batch mode as part of a tool chain for arbitrary csv files. Having found cases where memory usage was as high as 20x file size, and allowing for a copy of the the loaded dataset, I'll just need to document that it is possible that files as small as 1/40th of system memory may consume it all. That rules out some important datasets (US Census, UK Office of National Statistics files, etc) for 2GB servers. Regards, Mike On 8/9/07, Prof Brian Ripley <[EMAIL PROTECTED]> wrote: > On Thu, 9 Aug 2007, Charles C. Berry wrote: > > > On Thu, 9 Aug 2007, Michael Cassin wrote: > > > >> I really appreciate the advice and this database solution will be useful to > >> me for other problems, but in this case I need to address the specific > >> problem of scan and read.* using so much memory. > >> > >> Is this expected behaviour? > > Yes, and documented in the 'R Internals' manual. That is basic reading > for people wishing to comment on efficiency issues in R. > > >> Can the memory usage be explained, and can it be > >> made more efficient? For what it's worth, I'd be glad to try to help if > >> the > >> code for scan is considered to be worth reviewing. > > > > Mike, > > > > This does not seem to be an issue with scan() per se. > > > > Notice the difference in size of big2, big3, and bigThree here: > > > >> big2 <- rep(letters,length=1e6) > >> object.size(big2)/1e6 > > [1] 4.000856 > >> big3 <- paste(big2,big2,sep='') > >> object.size(big3)/1e6 > > [1] 36.00002 > > On a 32-bit computer every R object has an overhead of 24 or 28 bytes. > Character strings are R objects, but in some functions such as rep (and > scan for up to 10,000 distinct strings) the objects can be shared. More > string objects will be shared in 2.6.0 (but factors are designed to be > efficient at storing character vectors with few values). > > On a 64-bit computer the overhead is usually double. So I would expect > just over 56 bytes/string for distinct short strings (and that is what > big3 gives). > > But 56Mb is really not very much (tiny on a 64-bit computer), and 1 > million items is a lot. > > [...] > > > -- > Brian D. Ripley, [EMAIL PROTECTED] > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.