Try it as a factor: > big2 <- rep(letters,length=1e6) > object.size(big2)/1e6 [1] 4.000856 > object.size(as.factor(big2))/1e6 [1] 4.001184
> big3 <- paste(big2,big2,sep='') > object.size(big3)/1e6 [1] 36.00002 > object.size(as.factor(big3))/1e6 [1] 4.001184 On 8/9/07, Charles C. Berry <[EMAIL PROTECTED]> wrote: > On Thu, 9 Aug 2007, Michael Cassin wrote: > > > I really appreciate the advice and this database solution will be useful to > > me for other problems, but in this case I need to address the specific > > problem of scan and read.* using so much memory. > > > > Is this expected behaviour? Can the memory usage be explained, and can it be > > made more efficient? For what it's worth, I'd be glad to try to help if the > > code for scan is considered to be worth reviewing. > > Mike, > > This does not seem to be an issue with scan() per se. > > Notice the difference in size of big2, big3, and bigThree here: > > > big2 <- rep(letters,length=1e6) > > object.size(big2)/1e6 > [1] 4.000856 > > big3 <- paste(big2,big2,sep='') > > object.size(big3)/1e6 > [1] 36.00002 > > > > cat(big2, file='lotsaletters.txt', sep='\n') > > bigTwo <- scan('lotsaletters.txt',what='') > Read 1000000 items > > object.size(bigTwo)/1e6 > [1] 4.000856 > > cat(big3, file='moreletters.txt', sep='\n') > > bigThree <- scan('moreletters.txt',what='') > Read 1000000 items > > object.size(bigThree)/1e6 > [1] 4.000856 > > all.equal(big3,bigThree) > [1] TRUE > > > Chuck > > p.s. > > version > _ > platform i386-pc-mingw32 > arch i386 > os mingw32 > system i386, mingw32 > status > major 2 > minor 5.1 > year 2007 > month 06 > day 27 > svn rev 42083 > language R > version.string R version 2.5.1 (2007-06-27) > > > > > > > Regards, Mike > > > > On 8/9/07, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > >> > >> Just one other thing. > >> > >> The command in my prior post reads the data into an in-memory database. > >> If you find that is a problem then you can read it into a disk-based > >> database by adding the dbname argument to the sqldf call > >> naming the database. The database need not exist. It will > >> be created by sqldf and then deleted when its through: > >> > >> DF <- sqldf("select * from f", dbname = tempfile(), > >> file.format = list(header = TRUE, row.names = FALSE)) > >> > >> > >> On 8/9/07, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > >>> Another thing you could try would be reading it into a data base and > >> then > >>> from there into R. > >>> > >>> The devel version of sqldf has this capability. That is it will use > >> RSQLite > >>> to read the file directly into the database without going through R at > >> all > >>> and then read it from there into R so its a completely different > >> process. > >>> The RSQLite software has no capability of dealing with quotes (they will > >>> be regarded as ordinary characters) but a single gsub can remove them > >>> afterwards. This won't work if there are commas within the quotes but > >>> in that case you could read each row as a single record and then > >>> split it yourself in R. > >>> > >>> Try this > >>> > >>> library(sqldf) > >>> # next statement grabs the devel version software that does this > >>> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > >>> > >>> gc() > >>> f <- file("big.csv") > >>> DF <- sqldf("select * from f", file.format = list(header = TRUE, > >>> row.names = FALSE)) > >>> gc() > >>> > >>> For more info see the man page from the devel version and the home page: > >>> > >>> http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd > >>> http://code.google.com/p/sqldf/ > >>> > >>> > >>> On 8/9/07, Michael Cassin <[EMAIL PROTECTED]> wrote: > >>>> Thanks for looking, but my file has quotes. It's also 400MB, and I > >> don't > >>>> mind waiting, but don't have 6x the memory to read it in. > >>>> > >>>> > >>>> On 8/9/07, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > >>>>> If we add quote = FALSE to the write.csv statement its twice as fast > >>>>> reading it in. > >>>>> > >>>>> On 8/9/07, Michael Cassin <[EMAIL PROTECTED]> wrote: > >>>>>> Hi, > >>>>>> > >>>>>> I've been having similar experiences and haven't been able to > >>>>>> substantially improve the efficiency using the guidance in the I/O > >>>>>> Manual. > >>>>>> > >>>>>> Could anyone advise on how to improve the following scan()? It is > >> not > >>>>>> based on my real file, please assume that I do need to read in > >>>>>> characters, and can't do any pre-processing of the file, etc. > >>>>>> > >>>>>> ## Create Sample File > >>>>>> > >>>> write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv", > >> row.names=FALSE) > >>>>>> q() > >>>>>> > >>>>>> **New Session** > >>>>>> #R > >>>>>> system("ls -l big.csv") > >>>>>> system("free -m") > >>>>>> > >>>> big1<-matrix(scan("big.csv > >> ",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) > >>>>>> system("free -m") > >>>>>> > >>>>>> The file is approximately 9MB, but approximately 50-60MB is used > >> to > >>>>>> read it in. > >>>>>> > >>>>>> object.size(big1) is 56MB, or 56 bytes per string, which seems > >>>> excessive. > >>>>>> > >>>>>> Regards, Mike > >>>>>> > >>>>>> Configuration info: > >>>>>>> sessionInfo() > >>>>>> R version 2.5.1 (2007-06-27) > >>>>>> x86_64-redhat-linux-gnu > >>>>>> locale: > >>>>>> C > >>>>>> attached base packages: > >>>>>> [1] "stats" "graphics" "grDevices" "utils" "datasets" > >>>> "methods" > >>>>>> [7] "base" > >>>>>> > >>>>>> # uname -a > >>>>>> Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 > >> MSD > >>>>>> 2007 x86_64 x86_64 x86_64 GNU/Linux > >>>>>> > >>>>>> > >>>>>> > >>>>>> ====== Quoted Text ==== > >>>>>> From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> > >>>>>> Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> The R Data Import/Export Manual points out several ways in which > >> you > >>>>>> can use read.csv more efficiently. > >>>>>> > >>>>>> On Tue, 26 Jun 2007, ivo welch wrote: > >>>>>> > >>>>>> > dear R experts: > >>>>>> > > >>>>>>> I am of course no R experts, but use it regularly. I thought I > >> would > >>>>>>> share some experimentation with memory use. I run a linux > >> machine > >>>>>>> with about 4GB of memory, and R 2.5.0. > >>>>>>> > >>>>>>> upon startup, gc() reports > >>>>>>> > >>>>>>> used (Mb) gc trigger (Mb) max used (Mb) > >>>>>>> Ncells 268755 14.4 407500 21.8 350000 18.7 > >>>>>>> Vcells 139137 1.1 786432 6.0 444750 3.4 > >>>>>>> > >>>>>>> This is my baseline. linux 'top' reports 48MB as > >> baseline. This > >>>>>>> includes some of my own routines that are always loaded. Good.. > >>>>>>> > >>>>>>> > >>>>>>> Next, I created a s.csv file with 22 variables and 500,000 > >>>>>>> observations, taking up an uncompressed disk space of > >> 115MB. The > >>>>>>> resulting object.size() after a read.csv() is 84,002,712 bytes > >> (80MB). > >>>>>>> > >>>>>>>> s= read.csv("s.csv"); > >>>>>>>> object.size(s); > >>>>>>> > >>>>>>> [1] 84002712 > >>>>>>> > >>>>>>> > >>>>>>> here is where things get more interesting. after the read.csv() > >> is > >>>>>>> finished, gc() reports > >>>>>>> > >>>>>>> used (Mb) gc trigger (Mb) max used (Mb) > >>>>>>> Ncells 270505 14.5 8349948 446.0 11268682 601.9 > >>>>>>> Vcells 10639515 81.2 34345544 262.1 42834692 326.9 > >>>>>>> > >>>>>>> I was a big surprised by this---R had 928MB intermittent memory > >> in > >>>>>>> use. More interestingly, this is also similar to what linux > >> 'top' > >>>>>>> reports as memory use of the R process (919MB, probably 1024 vs. > >> 1000 > >>>>>>> B/MB), even after the read.csv() is finished and gc() has been > >> run. > >>>>>>> Nothing seems to have been released back to the OS. > >>>>>>> > >>>>>>> Now, > >>>>>>> > >>>>>>>> rm(s) > >>>>>>>> gc() > >>>>>>> used (Mb) gc trigger (Mb) max used (Mb) > >>>>>>> Ncells 270541 14.5 6679958 356.8 11268755 601.9 > >>>>>>> Vcells 139481 1.1 27476536 209.7 42807620 326.6 > >>>>>>> > >>>>>>> linux 'top' now reports 650MB of memory use (though R itself > >> uses only > >>>>>>> 15.6Mb). My guess is that It leaves the trigger memory of 567MB > >> plus > >>>>>>> the base 48MB. > >>>>>>> > >>>>>>> > >>>>>>> There are two interesting observations for me here: first, to > >> read a > >>>>>>> .csv file, I need to have at least 10-15 times as much memory as > >> the > >>>>>>> file that I want to read---a lot more than the factor of 3-4 > >> that I > >>>>>>> had expected. The moral is that IF R can read a .csv file, one > >> need > >>>>>>> not worry too much about running into memory constraints > >> lateron. {R > >>>>>>> Developers---reducing read.csv's memory requirement a little > >> would be > >>>>>>> nice. of course, you have more than enough on your plate, > >> already.} > >>>>>>> > >>>>>>> Second, memory is not returned fully to the OS. This is not > >>>>>>> necessarily a bad thing, but good to know. > >>>>>>> > >>>>>>> Hope this helps... > >>>>>>> > >>>>>>> Sincerely, > >>>>>>> > >>>>>>> /iaw > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help_at_stat.math.ethz.ch mailing list > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible > >> code. > >>>>>>> > >>>>>> -- > >>>>>> Brian D. Ripley, > >>>> ripley_at_stats.ox.ac.uk > >>>>>> Professor of Applied Statistics, > >>>> http://www.stats.ox.ac.uk/~ripley/ > >>>>>> University of Oxford, Tel: +44 1865 272861 (self) > >>>>>> 1 South Parks Road, +44 1865 272866 (PA) > >>>>>> Oxford OX1 3TG, UK Fax: +44 1865 272595 > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help@stat.math.ethz.ch mailing list > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive Medicine > E mailto:[EMAIL PROTECTED] UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.