I don't understand why one would run a 64-bit version of R on a 2GB server, especially if one were worried about object size. You can run 32-bit versions of R on x86_64 Linux (see the R-admin manual for a comprehensive discussion), and most other 64-bit OSes default to 32-bit executables.
Since most OSes limit 32-bit executables to around 3GB of address space, there starts to become a case for 64-bit executables at 4GB RAM but not much case at 2GB. It was my intention when providing the infrastructure for it that Linux binary distributions on x86_64 would provide both 32-bit and 64-bit executables, but that has not happened. It would be possible to install ix86 builds on x86_64 if -m32 was part of the ix86 compiler specification and the dependency checks would notice they needed 32-bit libraries. (I've had trouble with the latter on FC5: an X11 update removed all my 32-bit X11 RPMs.) On Fri, 10 Aug 2007, Michael Cassin wrote: > Thanks for all the comments, > > The artificial dataset is as representative of my 440MB file as I could > design. > > I did my best to reduce the complexity of my problem to minimal > reproducible code as suggested in the posting guidelines. Having > searched the archives, I was happy to find that the topic had been > covered, where Prof Ripley suggested that the I/O manuals gave some > advice. However, I was unable to get anywhere with the I/O manuals > advice. > > I spent 6 hours preparing my post to R-help. Sorry not to have read > the 'R-Internals' manual. I just wanted to know if I could use scan() > more efficiently. > > My hurdle seems nothing to do with efficiently calling scan() . I > suspect the same is true for the originator of this memory experiment > thread. It is the overhead of storing short strings, as Charles > identified and Brian explained. I appreciate the investigation and > clarification you both have made. > > 56B overhead for a 2 character string seems extreme to me, but I'm not > complaining. I really like R, and being free, accept that > it-is-what-it-is. Well, there are only about 50000 2-char strings in an 8-bit locale, so this does seem a case for using factors (as has been pointed out several times). And BTW, it is not 56B overhead, but 56B total for up to 7 chars. > In my case pre-processing is not an option, it is not a one off > problem with a particular file. In my application, R is run in batch > mode as part of a tool chain for arbitrary csv files. Having found > cases where memory usage was as high as 20x file size, and allowing > for a copy of the the loaded dataset, I'll just need to document that > it is possible that files as small as 1/40th of system memory may > consume it all. That rules out some important datasets (US Census, UK > Office of National Statistics files, etc) for 2GB servers. > > Regards, Mike > > > On 8/9/07, Prof Brian Ripley <[EMAIL PROTECTED]> wrote: >> On Thu, 9 Aug 2007, Charles C. Berry wrote: >> >>> On Thu, 9 Aug 2007, Michael Cassin wrote: >>> >>>> I really appreciate the advice and this database solution will be useful to >>>> me for other problems, but in this case I need to address the specific >>>> problem of scan and read.* using so much memory. >>>> >>>> Is this expected behaviour? >> >> Yes, and documented in the 'R Internals' manual. That is basic reading >> for people wishing to comment on efficiency issues in R. >> >>>> Can the memory usage be explained, and can it be >>>> made more efficient? For what it's worth, I'd be glad to try to help if >>>> the >>>> code for scan is considered to be worth reviewing. >>> >>> Mike, >>> >>> This does not seem to be an issue with scan() per se. >>> >>> Notice the difference in size of big2, big3, and bigThree here: >>> >>>> big2 <- rep(letters,length=1e6) >>>> object.size(big2)/1e6 >>> [1] 4.000856 >>>> big3 <- paste(big2,big2,sep='') >>>> object.size(big3)/1e6 >>> [1] 36.00002 >> >> On a 32-bit computer every R object has an overhead of 24 or 28 bytes. >> Character strings are R objects, but in some functions such as rep (and >> scan for up to 10,000 distinct strings) the objects can be shared. More >> string objects will be shared in 2.6.0 (but factors are designed to be >> efficient at storing character vectors with few values). >> >> On a 64-bit computer the overhead is usually double. So I would expect >> just over 56 bytes/string for distinct short strings (and that is what >> big3 gives). >> >> But 56Mb is really not very much (tiny on a 64-bit computer), and 1 >> million items is a lot. >> >> [...] >> >> >> -- >> Brian D. Ripley, [EMAIL PROTECTED] >> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >> University of Oxford, Tel: +44 1865 272861 (self) >> 1 South Parks Road, +44 1865 272866 (PA) >> Oxford OX1 3TG, UK Fax: +44 1865 272595 >> > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.