On Wed, 10 Jan 2007, Iris Kolder wrote: > Hi > > I listened to all your advise and ran my data on a computer with a 64 > bits procesor but i still get the same error saying "it cannot allocate > a vector of that size 1240 kb" . I don't want to cut my data in smaller > pieces because we are looking at interaction. So are there any other > options for me to try out or should i wait for the development of more > advanced computers!
Did you use a 64-bit build of R on that machine? If the message is the same, I strongly suspect not. 64-bit builds are not the default on most OSes. > > Thanks, > > Iris > > > ----- Forwarded Message ---- > From: Iris Kolder <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, December 21, 2006 2:07:08 PM > Subject: Re: [R] Memory problem on a linux cluster using a large data set > [Broadcast] > > > Thank you all for your help! > > So with all your suggestions we will try to run it on a computer with a > 64 bits proccesor. But i've been told that the new R versions all work > on a 32bits processor. I read in other posts that only the old R > versions were capable of larger data sets and were running under 64 bit > proccesors. I also read that they are adapting the new R version for 64 > bits proccesors again so does anyone now if there is a version available > that we could use? > > Iris Kolder > > ----- Original Message ---- > From: "Liaw, Andy" <[EMAIL PROTECTED]> > To: Martin Morgan <[EMAIL PROTECTED]>; Iris Kolder <[EMAIL PROTECTED]> > Cc: [email protected]; N.C. Onland-moret <[EMAIL PROTECTED]> > Sent: Monday, December 18, 2006 7:48:23 PM > Subject: RE: [R] Memory problem on a linux cluster using a large data set > [Broadcast] > > > In addition to my off-list reply to Iris (pointing her to an old post of > mine that detailed the memory requirement of RF in R), she might > consider the following: > > - Use larger nodesize > - Use sampsize to control the size of bootstrap samples > > Both of these have the effect of reducing sizes of trees grown. For a > data set that large, it may not matter to grow smaller trees. > > Still, with data of that size, I'd say 64-bit is the better solution. > > Cheers, > Andy > > From: Martin Morgan >> >> Iris -- >> >> I hope the following helps; I think you have too much data >> for a 32-bit machine. >> >> Martin >> >> Iris Kolder <[EMAIL PROTECTED]> writes: >> >>> Hello, >>> >>> I have a large data set 320.000 rows and 1000 columns. All the data >>> has the values 0,1,2. >> >> It seems like a single copy of this data set will be at least >> a couple of gigabytes; I think you'll have access to only 4 >> GB on a 32-bit machine (see section 8 of the R Installation >> and Administration guide), and R will probably end up, even >> in the best of situations, making at least a couple of copies >> of your data. Probably you'll need a 64-bit machine, or >> figure out algorithms that work on chunks of data. >> >>> on a linux cluster with R version R 2.1.0. which operates on a 32 >> >> This is quite old, and in general it seems like R has become >> more sensitive to big-data issues and tracking down >> unnecessary memory copying. >> >>> "cannot allocate vector size 1240 kb". I've searched through >> >> use traceback() or options(error=recover) to figure out where >> this is actually occurring. >> >>> SNP <- read.table("file.txt", header=FALSE, sep="") # >> read in data file >> >> This makes a data.frame, and data frames have several aspects >> (e.g., automatic creation of row names on sub-setting) that >> can be problematic in terms of memory use. Probably better to >> use a matrix, for which: >> >> 'read.table' is not the right tool for reading large matrices, >> especially those with many columns: it is designed to read _data >> frames_ which may have columns of very different classes. Use >> 'scan' instead. >> >> (from the help page for read.table). I'm not sure of the >> details of the algorithms you'll invoke, but it might be a >> false economy to try to get scan to read in 'small' versions >> (e.g., integer, rather than >> numeric) of the data -- the algorithms might insist on >> numeric data, and then make a copy during coercion from your >> small version to numeric. >> >>> SNP$total.NAs = rowSums(is.na(SN # calculate the >> number of NA per row and adds a colum with total Na's >> >> This adds a column to the data.frame or matrix, probably >> causing at least one copy of the entire data. Create a >> separate vector instead, even though this unties the >> coordination between columns that a data frame provides. >> >>> SNP = t(as.matrix(SNP)) # >> transpose rows and columns >> >> This will also probably trigger a copy; >> >>> snp.na<-SNP >> >> R might be clever enough to figure out that this simple >> assignment does not trigger a copy. But it probably means >> that any subsequent modification of snp.na or SNP *will* >> trigger a copy, so avoid the assignment if possible. >> >>> snp.roughfix<-na.roughfix(snp.na) >> >>> fSNP<-factor(snp.roughfix[, 1]) # Asigns >> factor to case control status >>> >>> snp.narf<- randomForest(snp.roughfix[,-1], fSNP, >>> na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, >>> keep.forest=FALSE, do.trace=100) >> >> Now you're entirely in the hands of the randomForest. If >> memory problems occur here, perhaps you'll have gained enough >> experience to point the package maintainer to the problem and >> suggest a possible solution. >> >>> set it should be able to cope with that amount. Perhaps someone has >>> tried this before in R or is Fortram a better choice? I added my R >> >> If you mean a pure Fortran solution, including coding the >> random forest algorithm, then of course you have complete >> control over memory management. You'd still likely be limited >> to addressing 4 GB of memory. >> >> >>> I wrote a script to remove all the rows with more than 46 missing >>> values. This works perfect on a smaller dataset. But the problem >>> arises when I try to run it on the larger data set I get an error >>> "cannot allocate vector size 1240 kb". I've searched >> through previous >>> posts and found out that it might be because i'm running it >> on a linux >>> cluster with R version R 2.1.0. which operates on a 32 bit >> processor. >>> But I could not find a solution for this problem. The cluster is a >>> really fast one and should be able to cope with these large >> amounts of >>> data the systems configuration are Speed: 3.4 GHz, memory >> 4GByte. Is >>> there a way to change the settings or processor under R? I >> want to run >>> the function Random Forest on my large data set it should >> be able to >>> cope with that amount. Perhaps someone has tried this >> before in R or >>> is Fortram a better choice? I added my R script down below. >>> >>> Best regards, >>> >>> Iris Kolder >>> >>> SNP <- read.table("file.txt", header=FALSE, sep="") # >> read in data file >>> SNP[SNP==9]<-NA # change >> missing values from a 9 to a NA >>> SNP$total.NAs = rowSums(is.na(SN # calculate the >> number of NA per row and adds a colum with total Na's >>> SNP = SNP[ SNP$total.NAs < 46, ] # create a subset >> with no more than 5%(46) NA's >>> SNP$total.NAs=NULL # remove >> added column with sum of NA's >>> SNP = t(as.matrix(SNP)) # >> transpose rows and columns >>> set.seed(1) >> >>> snp.na<-SNP >>> snp.roughfix<-na.roughfix(snp.na) >> >>> fSNP<-factor(snp.roughfix[, 1]) # Asigns >> factor to case control status >>> >>> snp.narf<- randomForest(snp.roughfix[,-1], fSNP, >>> na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, >>> keep.forest=FALSE, do.trace=100) >>> >>> print(snp.narf) >>> >>> __________________________________________________ >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> [email protected] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> -- >> Martin T. Morgan >> Bioconductor / Computational Biology >> http://bioconductor.org >> >> ______________________________________________ >> [email protected] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> >> > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, > New Jersey, USA 08889), and/or its affiliates (which may be known > outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD > and in Japan, as Banyu - direct contact information for affiliates is > available at http://www.merck.com/contact/contacts.html) that may be > confidential, proprietary copyrighted and/or legally privileged. It is > intended solely for the use of the individual or entity named on this > message. If you are not the intended recipient, and have received this > message in error, please notify us immediately by reply e-mail and then > delete it from your system. > > ------------------------------------------------------------------------------ > > > > __________________________________________________ > > > > > > > ____________________________________________________________________________________ > Want to start your own business? > > > [[alternative HTML version deleted]] > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
