I set maxnodes and nodesize to reasonable levels and everything is
working great now. Thanks for the guidance.

v/r,
John

On Thu, Sep 8, 2011 at 8:58 AM, Liaw, Andy <[email protected]> wrote:
> It looks like you are building a regression model.  With such a large number 
> of rows, you should try to limit the size of the trees by setting nodesize to 
> something larger than the default (5).  The issue, I suspect, is the fact 
> that the size of the largest possible tree has about 2*nodesize nodes, and 
> each node takes a row in a matrix to store.  Multiply that by the number of 
> trees you are trying to build, and you see how the memory can be gobbled up 
> quickly.  Boosted trees don't usually run into this problem because one 
> usually boost very small trees (usually no more than 10 terminal nodes per 
> tree).
>
> Best,
> Andy
>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of John Foreman
>> Sent: Wednesday, September 07, 2011 2:46 PM
>> To: [email protected]
>> Subject: [R] randomForest memory footprint
>>
>> Hello, I am attempting to train a random forest model using the
>> randomForest package on 500,000 rows and 8 columns (7 predictors, 1
>> response). The data set is the first block of data from the UCI
>> Machine Learning Repo dataset "Record Linkage Comparison Patterns"
>> with the slight modification that I dropped two columns with lots of
>> NA's and I used knn imputation to fill in other gaps.
>>
>> When I load in my dataset, R uses no more than 100 megs of RAM. I'm
>> running a 64-bit R with ~4 gigs of RAM available. When I execute the
>> randomForest() function, however I get memory complaints. Example:
>>
>> > summary(mydata1.clean[,3:10])
>>   cmp_fname_c1     cmp_lname_c1       cmp_sex           cmp_bd
>>   cmp_bm           cmp_by          cmp_plz         is_match
>>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
>> Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   FALSE:572820
>>  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.0000   1st Qu.:0.0000
>> 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   TRUE :  2093
>>  Median :1.0000   Median :0.1818   Median :1.0000   Median :0.0000
>> Median :0.0000   Median :0.0000   Median :0.00000
>>  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
>> Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
>>  3rd Qu.:1.0000   3rd Qu.:0.4286   3rd Qu.:1.0000   3rd Qu.:0.0000
>> 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000
>>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
>> Max.   :1.0000   Max.   :1.0000   Max.   :1.00000
>> > mydata1.rf.model2 <- randomForest(x =
>> mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
>> Error: cannot allocate vector of size 877.2 Mb
>> In addition: Warning messages:
>> 1: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>> 2: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>> 3: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>> 4: In dim(data) <- dim :
>>   Reached total allocation of 3992Mb: see help(memory.size)
>>
>> Other techniques such as boosted trees handle the data size just fine.
>> Are there any parameters I can adjust such that I can use a value of
>> 100 or more for ntree?
>>
>> Thanks,
>> John
>>
>> ______________________________________________
>> [email protected] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to