It looks like you are building a regression model.  With such a large number of 
rows, you should try to limit the size of the trees by setting nodesize to 
something larger than the default (5).  The issue, I suspect, is the fact that 
the size of the largest possible tree has about 2*nodesize nodes, and each node 
takes a row in a matrix to store.  Multiply that by the number of trees you are 
trying to build, and you see how the memory can be gobbled up quickly.  Boosted 
trees don't usually run into this problem because one usually boost very small 
trees (usually no more than 10 terminal nodes per tree).

Best,
Andy 

> -----Original Message-----
> From: r-help-boun...@r-project.org 
> [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman
> Sent: Wednesday, September 07, 2011 2:46 PM
> To: r-help@r-project.org
> Subject: [R] randomForest memory footprint
> 
> Hello, I am attempting to train a random forest model using the
> randomForest package on 500,000 rows and 8 columns (7 predictors, 1
> response). The data set is the first block of data from the UCI
> Machine Learning Repo dataset "Record Linkage Comparison Patterns"
> with the slight modification that I dropped two columns with lots of
> NA's and I used knn imputation to fill in other gaps.
> 
> When I load in my dataset, R uses no more than 100 megs of RAM. I'm
> running a 64-bit R with ~4 gigs of RAM available. When I execute the
> randomForest() function, however I get memory complaints. Example:
> 
> > summary(mydata1.clean[,3:10])
>   cmp_fname_c1     cmp_lname_c1       cmp_sex           cmp_bd
>   cmp_bm           cmp_by          cmp_plz         is_match
>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
> Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   FALSE:572820
>  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.0000   1st Qu.:0.0000
> 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   TRUE :  2093
>  Median :1.0000   Median :0.1818   Median :1.0000   Median :0.0000
> Median :0.0000   Median :0.0000   Median :0.00000
>  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
> Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
>  3rd Qu.:1.0000   3rd Qu.:0.4286   3rd Qu.:1.0000   3rd Qu.:0.0000
> 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000
>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
> Max.   :1.0000   Max.   :1.0000   Max.   :1.00000
> > mydata1.rf.model2 <- randomForest(x = 
> mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
> Error: cannot allocate vector of size 877.2 Mb
> In addition: Warning messages:
> 1: In dim(data) <- dim :
>   Reached total allocation of 3992Mb: see help(memory.size)
> 2: In dim(data) <- dim :
>   Reached total allocation of 3992Mb: see help(memory.size)
> 3: In dim(data) <- dim :
>   Reached total allocation of 3992Mb: see help(memory.size)
> 4: In dim(data) <- dim :
>   Reached total allocation of 3992Mb: see help(memory.size)
> 
> Other techniques such as boosted trees handle the data size just fine.
> Are there any parameters I can adjust such that I can use a value of
> 100 or more for ntree?
> 
> Thanks,
> John
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to