Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Mike Williamson Thu, 01 Jul 2010 16:46:08 -0700

Jim, Andy,

    Thanks for your suggestions!


    I found some time today to futz around with it, and I found a "home
made" script to fill in NA values to be much quicker.  For those who are
interested, instead of using:

          dataSet <- na.roughfix(dataSet)



    I used:

                    origCols <- names(dataSet)
                    ## Fix numeric values...
                    dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
{
                        if(!is.numeric(x)) { x } else {
                            ifelse(is.na(x), median(x, na.rm=TRUE), x) } }
),
                                             row.names=row.names(dataSet) )
                    ## Fix factors...
                    dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
{
                        if(!is.factor(x)) { x } else {
                            levels(x)[ifelse(!is.na
(x),x,table(max(table(x)))
                                                          ) ] } } ),
                                             row.names=row.names(dataSet) )
                    names(dataSet) <- origCols



    In one case study that I ran, the na.roughfix() algo took 296 seconds
whereas the homemade one above took 16 seconds.

                                      Regards,
                                            Mike



"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
 -- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en


On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_l...@merck.com> wrote:

>  You need to isolate the problem further, or give more detail about your
> data.  This is what I get:
>
> R> nr <- 2134
> R> nc <- 14037
> R> x <- matrix(runif(nr*nc), nr, nc)
> R> n.na <- round(nr*nc/10)
> R> x[sample(nr*nc, n.na)] <- NA
> R> system.time(x.fixed <- na.roughfix(x))
>    user  system elapsed
>    8.44    0.39    8.85
> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB
> ram.
>
> Andy
>
>  ------------------------------
> *From:* Mike Williamson [mailto:this.is....@gmail.com]
> *Sent:* Thursday, July 01, 2010 12:48 PM
> *To:* Liaw, Andy
> *Cc:* r-help
> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
> so slow??
>
> Andy,
>
>     You're right, I didn't supply any code, because my call was very simple
> and it was the call itself at question.  However, here is the associated
> code I am using:
>
>
>         naFixTime <- system.time( {
>             if (fltrResponse) {  ## TRUE: there are no NA's in the
> response... cleared via earlier steps
>                 message(paste(iAm,": Missing values will now be
> imputed...\n", sep=""))
>         try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
> response)],
>                                          dataSet[,response]) )
>             } else {  ## In this case, there is no "response" column in the
> data set
>                 message(paste(iAm,": Missing values will now be filled in
> with median",
>                               " values or most frequent levels", sep=""))
>                 try( dataSet <- na.roughfix(dataSet) )
>             }
>         } )
>
>
>
>     As you can see, the "na.roughfix" call is made as simply as possible:
> I supply the entire dataSet (only parameters, no responses).  I am not doing
> the prediction here (that is done later, and the prediction itself is not
> taking very long).
>     Here are some calculation times that I experienced:
>
> # rows       # cols       time to run na.roughfix
> =======     =======     ====================
>   2046          2833             ~ 2 minutes
>   2066          5626             ~ 6 minutes
>   2134         14037             ~ 30 minutes
>
>     These numbers are on a Windows server using the 64-bit version of 'R'.
>
>                                           Regards,
>                                                    Mike
>
>
> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> Some x-ray slides, a music score, Minard's Napoleanic war:
> The most exciting frontier is charting what's already here."
>  -- xkcd
>
> --
> Help protect Wikipedia. Donate now:
> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>
>
> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote:
>
>> You have not shown any code on exactly how you use na.roughfix(), so I
>> can only guess.
>>
>> If you are doing something like:
>>
>>  randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
>>
>> I would not be surprised that it's taking very long on large datasets.
>> Most likely it's caused by the formula interface, not na.roughfix()
>> itself.
>>
>> If that is your case, try doing the imputation beforehand and run
>> randomForest() afterward; e.g.,
>>
>> myroughfixed <- na.roughfix(mybigdata)
>> randomForest(myroughfixed[list.of.predictor.columns],
>> myroughfixed[[myresponse]],...)
>>
>> HTH,
>> Andy
>>
>> -----Original Message-----
>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
>> On Behalf Of Mike Williamson
>> Sent: Wednesday, June 30, 2010 7:53 PM
>> To: r-help
>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so
>> slow??
>>
>> Hi all,
>>
>>    I am using the package "random forest" for random forest
>> predictions.  I
>> like the package.  However, I have fairly large data sets, and it can
>> often
>> take *hours* just to go through the "na.roughfix" call, which simply
>> goes
>> through and cleans up any NA values to either the median (numerical
>> data) or
>> the most frequent occurrence (factors).
>>    I am going to start doing some comparisons between na.roughfix() and
>> some apply() functions which, it seems, are able to do the same job more
>> quickly.  But I hesitate to duplicate a function that is already in the
>> package, since I presume the na.roughfix should be as quick as possible
>> and
>> it should also be well "tailored" to the requirements of random forest.
>>
>>    Has anyone else seen that this is really slow?  (I haven't noticed
>> rfImpute to be nearly as slow, but I cannot say for sure:  my "predict"
>> data
>> sets are MUCH larger than my model data sets, so cleaning the prediction
>> data set simply takes much longer.)
>>    If so, any ideas how to speed this up?
>>
>>                              Thanks!
>>                                   Mike
>>
>>
>>
>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
>> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
>> Some x-ray slides, a music score, Minard's Napoleanic war:
>> The most exciting frontier is charting what's already here."
>>  -- xkcd
>>
>> --
>> Help protect Wikipedia. Donate now:
>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> Notice:  This e-mail message, together with any attachments, contains
>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>>
> Notice:  This e-mail message, together with any attach...{{dropped:15}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Reply via email to