One other idea. Don't use byrow = TRUE. Matrices are stored in column order so that might be more efficient. You can always transpose it later. Haven't tested it to see if it helps.
On 8/9/07, Michael Cassin <[EMAIL PROTECTED]> wrote: > > I really appreciate the advice and this database solution will be useful to > me for other problems, but in this case I need to address the specific > problem of scan and read.* using so much memory. > > Is this expected behaviour? Can the memory usage be explained, and can it be > made more efficient? For what it's worth, I'd be glad to try to help if the > code for scan is considered to be worth reviewing. > > Regards, Mike > > > On 8/9/07, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > > Just one other thing. > > > > The command in my prior post reads the data into an in-memory database. > > If you find that is a problem then you can read it into a disk-based > > database by adding the dbname argument to the sqldf call > > naming the database. The database need not exist. It will > > be created by sqldf and then deleted when its through: > > > > DF <- sqldf("select * from f", dbname = tempfile(), > > file.format = list(header = TRUE, row.names = FALSE)) > > > > > > On 8/9/07, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > > > Another thing you could try would be reading it into a data base and > then > > > from there into R. > > > > > > The devel version of sqldf has this capability. That is it will use > RSQLite > > > to read the file directly into the database without going through R at > all > > > and then read it from there into R so its a completely different > process. > > > The RSQLite software has no capability of dealing with quotes (they will > > > be regarded as ordinary characters) but a single gsub can remove them > > > afterwards. This won't work if there are commas within the quotes but > > > in that case you could read each row as a single record and then > > > split it yourself in R. > > > > > > Try this > > > > > > library(sqldf) > > > # next statement grabs the devel version software that does this > > > > source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") > > > > > > gc() > > > f <- file("big.csv") > > > DF <- sqldf("select * from f", file.format = list(header = TRUE, > > > row.names = FALSE)) > > > gc() > > > > > > For more info see the man page from the devel version and the home page: > > > > > > http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd > > > http://code.google.com/p/sqldf/ > > > > > > > > > On 8/9/07, Michael Cassin < [EMAIL PROTECTED]> wrote: > > > > Thanks for looking, but my file has quotes. It's also 400MB, and I > don't > > > > mind waiting, but don't have 6x the memory to read it in. > > > > > > > > > > > > On 8/9/07, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > > > > > If we add quote = FALSE to the write.csv statement its twice as fast > > > > > reading it in. > > > > > > > > > > On 8/9/07, Michael Cassin <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > > > > > > > > > > I've been having similar experiences and haven't been able to > > > > > > substantially improve the efficiency using the guidance in the I/O > > > > > > Manual. > > > > > > > > > > > > Could anyone advise on how to improve the following scan()? It is > not > > > > > > based on my real file, please assume that I do need to read in > > > > > > characters, and can't do any pre-processing of the file, etc. > > > > > > > > > > > > ## Create Sample File > > > > > > > > > > > write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",row.names=FALSE) > > > > > > q() > > > > > > > > > > > > **New Session** > > > > > > #R > > > > > > system("ls -l big.csv") > > > > > > system("free -m") > > > > > > > > > > > big1<-matrix(scan("big.csv",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) > > > > > > system("free -m") > > > > > > > > > > > > The file is approximately 9MB, but approximately 50-60MB is used > to > > > > > > read it in. > > > > > > > > > > > > object.size(big1) is 56MB, or 56 bytes per string, which seems > > > > excessive. > > > > > > > > > > > > Regards, Mike > > > > > > > > > > > > Configuration info: > > > > > > > sessionInfo() > > > > > > R version 2.5.1 (2007-06-27) > > > > > > x86_64-redhat-linux-gnu > > > > > > locale: > > > > > > C > > > > > > attached base packages: > > > > > > [1] "stats" "graphics" "grDevices" "utils" "datasets" > > > > "methods" > > > > > > [7] "base" > > > > > > > > > > > > # uname -a > > > > > > Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 > MSD > > > > > > 2007 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > > > > > > > > > > > > > > > ====== Quoted Text ==== > > > > > > From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> > > > > > > Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The R Data Import/Export Manual points out several ways in which > you > > > > > > can use read.csv more efficiently. > > > > > > > > > > > > On Tue, 26 Jun 2007, ivo welch wrote: > > > > > > > > > > > > > dear R experts: > > > > > > > > > > > > > > I am of course no R experts, but use it regularly. I thought I > would > > > > > > > share some experimentation with memory use. I run a linux > machine > > > > > > > with about 4GB of memory, and R 2.5.0. > > > > > > > > > > > > > > upon startup, gc() reports > > > > > > > > > > > > > > used (Mb) gc trigger (Mb) max used (Mb) > > > > > > > Ncells 268755 14.4 407500 21.8 350000 18.7 > > > > > > > Vcells 139137 1.1 786432 6.0 444750 3.4 > > > > > > > > > > > > > > This is my baseline. linux 'top' reports 48MB as baseline. > This > > > > > > > includes some of my own routines that are always loaded. Good.. > > > > > > > > > > > > > > > > > > > > > Next, I created a s.csv file with 22 variables and 500,000 > > > > > > > observations, taking up an uncompressed disk space of 115MB. > The > > > > > > > resulting object.size() after a read.csv() is 84,002,712 bytes > (80MB). > > > > > > > > > > > > > >> s= read.csv("s.csv"); > > > > > > >> object.size(s); > > > > > > > > > > > > > > [1] 84002712 > > > > > > > > > > > > > > > > > > > > > here is where things get more interesting. after the read.csv() > is > > > > > > > finished, gc() reports > > > > > > > > > > > > > > used (Mb) gc trigger (Mb) max used (Mb) > > > > > > > Ncells 270505 14.5 8349948 446.0 11268682 601.9 > > > > > > > Vcells 10639515 81.2 34345544 262.1 42834692 326.9 > > > > > > > > > > > > > > I was a big surprised by this---R had 928MB intermittent memory > in > > > > > > > use. More interestingly, this is also similar to what linux > 'top' > > > > > > > reports as memory use of the R process (919MB, probably 1024 vs. > 1000 > > > > > > > B/MB), even after the read.csv() is finished and gc() has been > run. > > > > > > > Nothing seems to have been released back to the OS. > > > > > > > > > > > > > > Now, > > > > > > > > > > > > > >> rm(s) > > > > > > >> gc() > > > > > > > used (Mb) gc trigger (Mb) max used (Mb) > > > > > > > Ncells 270541 14.5 6679958 356.8 11268755 601.9 > > > > > > > Vcells 139481 1.1 27476536 209.7 42807620 326.6 > > > > > > > > > > > > > > linux 'top' now reports 650MB of memory use (though R itself > uses only > > > > > > > 15.6Mb). My guess is that It leaves the trigger memory of 567MB > plus > > > > > > > the base 48MB. > > > > > > > > > > > > > > > > > > > > > There are two interesting observations for me here: first, to > read a > > > > > > > .csv file, I need to have at least 10-15 times as much memory as > the > > > > > > > file that I want to read---a lot more than the factor of 3-4 > that I > > > > > > > had expected. The moral is that IF R can read a .csv file, one > need > > > > > > > not worry too much about running into memory constraints > lateron. {R > > > > > > > Developers---reducing read.csv's memory requirement a little > would be > > > > > > > nice. of course, you have more than enough on your plate, > already.} > > > > > > > > > > > > > > Second, memory is not returned fully to the OS. This is not > > > > > > > necessarily a bad thing, but good to know. > > > > > > > > > > > > > > Hope this helps... > > > > > > > > > > > > > > Sincerely, > > > > > > > > > > > > > > /iaw > > > > > > > > > > > > > > ______________________________________________ > > > > > > > R-help_at_stat.math.ethz.ch mailing list > > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > > > PLEASE do read the posting guide > > > > http://www.R-project.org/posting-guide.html > > > > > > > and provide commented, minimal, self-contained, reproducible > code. > > > > > > > > > > > > > -- > > > > > > Brian D. Ripley, > > > > ripley_at_stats.ox.ac.uk > > > > > > Professor of Applied Statistics, > > > > http://www.stats.ox.ac.uk/~ripley/ > > > > > > University of Oxford, Tel: +44 1865 272861 (self) > > > > > > 1 South Parks Road, +44 1865 272866 (PA) > > > > > > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > > > > > > > > > > > ______________________________________________ > > > > > > R-help@stat.math.ethz.ch mailing list > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > > PLEASE do read the posting guide > > > > http://www.R-project.org/posting-guide.html > > > > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.