I set maxnodes and nodesize to reasonable levels and everything is working great now. Thanks for the guidance.
v/r, John On Thu, Sep 8, 2011 at 8:58 AM, Liaw, Andy <[email protected]> wrote: > It looks like you are building a regression model. With such a large number > of rows, you should try to limit the size of the trees by setting nodesize to > something larger than the default (5). The issue, I suspect, is the fact > that the size of the largest possible tree has about 2*nodesize nodes, and > each node takes a row in a matrix to store. Multiply that by the number of > trees you are trying to build, and you see how the memory can be gobbled up > quickly. Boosted trees don't usually run into this problem because one > usually boost very small trees (usually no more than 10 terminal nodes per > tree). > > Best, > Andy > >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of John Foreman >> Sent: Wednesday, September 07, 2011 2:46 PM >> To: [email protected] >> Subject: [R] randomForest memory footprint >> >> Hello, I am attempting to train a random forest model using the >> randomForest package on 500,000 rows and 8 columns (7 predictors, 1 >> response). The data set is the first block of data from the UCI >> Machine Learning Repo dataset "Record Linkage Comparison Patterns" >> with the slight modification that I dropped two columns with lots of >> NA's and I used knn imputation to fill in other gaps. >> >> When I load in my dataset, R uses no more than 100 megs of RAM. I'm >> running a 64-bit R with ~4 gigs of RAM available. When I execute the >> randomForest() function, however I get memory complaints. Example: >> >> > summary(mydata1.clean[,3:10]) >> cmp_fname_c1 cmp_lname_c1 cmp_sex cmp_bd >> cmp_bm cmp_by cmp_plz is_match >> Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 >> Min. :0.0000 Min. :0.0000 Min. :0.00000 FALSE:572820 >> 1st Qu.:0.2857 1st Qu.:0.1000 1st Qu.:1.0000 1st Qu.:0.0000 >> 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 TRUE : 2093 >> Median :1.0000 Median :0.1818 Median :1.0000 Median :0.0000 >> Median :0.0000 Median :0.0000 Median :0.00000 >> Mean :0.7127 Mean :0.3156 Mean :0.9551 Mean :0.2247 >> Mean :0.4886 Mean :0.2226 Mean :0.00549 >> 3rd Qu.:1.0000 3rd Qu.:0.4286 3rd Qu.:1.0000 3rd Qu.:0.0000 >> 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 >> Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 >> Max. :1.0000 Max. :1.0000 Max. :1.00000 >> > mydata1.rf.model2 <- randomForest(x = >> mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100) >> Error: cannot allocate vector of size 877.2 Mb >> In addition: Warning messages: >> 1: In dim(data) <- dim : >> Reached total allocation of 3992Mb: see help(memory.size) >> 2: In dim(data) <- dim : >> Reached total allocation of 3992Mb: see help(memory.size) >> 3: In dim(data) <- dim : >> Reached total allocation of 3992Mb: see help(memory.size) >> 4: In dim(data) <- dim : >> Reached total allocation of 3992Mb: see help(memory.size) >> >> Other techniques such as boosted trees handle the data size just fine. >> Are there any parameters I can adjust such that I can use a value of >> 100 or more for ntree? >> >> Thanks, >> John >> >> ______________________________________________ >> [email protected] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, > New Jersey, USA 08889), and/or its affiliates Direct contact information > for affiliates is available at > http://www.merck.com/contact/contacts.html) that may be confidential, > proprietary copyrighted and/or legally privileged. It is intended solely > for the use of the individual or entity named on this message. If you are > not the intended recipient, and have received this message in error, > please notify us immediately by reply e-mail and then delete it from > your system. > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

