Hadley,
Thanks! Yes... as.data.frame() is quite slow. (And it forces the
column names to become "acceptable" names, which is a hassle to fix all the
time.) I just hadn't thought of something as clever as what you wrote
below.
I'll try out this suggestion. :)
Mike
"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
-- xkcd
--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en
On Thu, Jul 1, 2010 at 5:07 PM, Hadley Wickham <[email protected]> wrote:
> Here's another version that's a bit easier to read:
>
> na.roughfix2 <- function (object, ...) {
> res <- lapply(object, roughfix)
> structure(res, class = "data.frame", row.names = seq_len(nrow(object)))
> }
>
> roughfix <- function(x) {
> missing <- is.na(x)
> if (!any(missing)) return(x)
>
> if (is.numeric(x)) {
> x[missing] <- median.default(x[!missing])
> } else if (is.factor(x)) {
> freq <- table(x)
> x[missing] <- names(freq)[which.max(freq)]
> } else {
> stop("na.roughfix only works for numeric or factor")
> }
> x
> }
>
> I'm cheating a bit because as.data.frame is so slow.
>
> Hadley
>
> On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <[email protected]>
> wrote:
> > Jim, Andy,
> >
> > Thanks for your suggestions!
> >
> > I found some time today to futz around with it, and I found a "home
> > made" script to fill in NA values to be much quicker. For those who are
> > interested, instead of using:
> >
> > dataSet <- na.roughfix(dataSet)
> >
> >
> >
> > I used:
> >
> > origCols <- names(dataSet)
> > ## Fix numeric values...
> > dataSet <- as.data.frame(lapply(dataSet,
> FUN=function(x)
> > {
> > if(!is.numeric(x)) { x } else {
> > ifelse(is.na(x), median(x, na.rm=TRUE), x) }
> }
> > ),
> > row.names=row.names(dataSet)
> )
> > ## Fix factors...
> > dataSet <- as.data.frame(lapply(dataSet,
> FUN=function(x)
> > {
> > if(!is.factor(x)) { x } else {
> > levels(x)[ifelse(!is.na
> > (x),x,table(max(table(x)))
> > ) ] } } ),
> > row.names=row.names(dataSet)
> )
> > names(dataSet) <- origCols
> >
> >
> >
> > In one case study that I ran, the na.roughfix() algo took 296 seconds
> > whereas the homemade one above took 16 seconds.
> >
> > Regards,
> > Mike
> >
> >
> >
> > "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> > Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> > Some x-ray slides, a music score, Minard's Napoleanic war:
> > The most exciting frontier is charting what's already here."
> > -- xkcd
> >
> > --
> > Help protect Wikipedia. Donate now:
> > http://wikimediafoundation.org/wiki/Support_Wikipedia/en
> >
> >
> > On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <[email protected]> wrote:
> >
> >> You need to isolate the problem further, or give more detail about your
> >> data. This is what I get:
> >>
> >> R> nr <- 2134
> >> R> nc <- 14037
> >> R> x <- matrix(runif(nr*nc), nr, nc)
> >> R> n.na <- round(nr*nc/10)
> >> R> x[sample(nr*nc, n.na)] <- NA
> >> R> system.time(x.fixed <- na.roughfix(x))
> >> user system elapsed
> >> 8.44 0.39 8.85
> >> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with
> 2GB
> >> ram.
> >>
> >> Andy
> >>
> >> ------------------------------
> >> *From:* Mike Williamson [mailto:[email protected]]
> >> *Sent:* Thursday, July 01, 2010 12:48 PM
> >> *To:* Liaw, Andy
> >> *Cc:* r-help
> >> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
> >> so slow??
> >>
> >> Andy,
> >>
> >> You're right, I didn't supply any code, because my call was very
> simple
> >> and it was the call itself at question. However, here is the associated
> >> code I am using:
> >>
> >>
> >> naFixTime <- system.time( {
> >> if (fltrResponse) { ## TRUE: there are no NA's in the
> >> response... cleared via earlier steps
> >> message(paste(iAm,": Missing values will now be
> >> imputed...\n", sep=""))
> >> try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
> >> response)],
> >> dataSet[,response]) )
> >> } else { ## In this case, there is no "response" column in
> the
> >> data set
> >> message(paste(iAm,": Missing values will now be filled
> in
> >> with median",
> >> " values or most frequent levels",
> sep=""))
> >> try( dataSet <- na.roughfix(dataSet) )
> >> }
> >> } )
> >>
> >>
> >>
> >> As you can see, the "na.roughfix" call is made as simply as
> possible:
> >> I supply the entire dataSet (only parameters, no responses). I am not
> doing
> >> the prediction here (that is done later, and the prediction itself is
> not
> >> taking very long).
> >> Here are some calculation times that I experienced:
> >>
> >> # rows # cols time to run na.roughfix
> >> ======= ======= ====================
> >> 2046 2833 ~ 2 minutes
> >> 2066 5626 ~ 6 minutes
> >> 2134 14037 ~ 30 minutes
> >>
> >> These numbers are on a Windows server using the 64-bit version of
> 'R'.
> >>
> >> Regards,
> >> Mike
> >>
> >>
> >> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> >> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> >> Some x-ray slides, a music score, Minard's Napoleanic war:
> >> The most exciting frontier is charting what's already here."
> >> -- xkcd
> >>
> >> --
> >> Help protect Wikipedia. Donate now:
> >> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
> >>
> >>
> >> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <[email protected]> wrote:
> >>
> >>> You have not shown any code on exactly how you use na.roughfix(), so I
> >>> can only guess.
> >>>
> >>> If you are doing something like:
> >>>
> >>> randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
> >>>
> >>> I would not be surprised that it's taking very long on large datasets.
> >>> Most likely it's caused by the formula interface, not na.roughfix()
> >>> itself.
> >>>
> >>> If that is your case, try doing the imputation beforehand and run
> >>> randomForest() afterward; e.g.,
> >>>
> >>> myroughfixed <- na.roughfix(mybigdata)
> >>> randomForest(myroughfixed[list.of.predictor.columns],
> >>> myroughfixed[[myresponse]],...)
> >>>
> >>> HTH,
> >>> Andy
> >>>
> >>> -----Original Message-----
> >>> From: [email protected] [mailto:
> [email protected]]
> >>> On Behalf Of Mike Williamson
> >>> Sent: Wednesday, June 30, 2010 7:53 PM
> >>> To: r-help
> >>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so
> >>> slow??
> >>>
> >>> Hi all,
> >>>
> >>> I am using the package "random forest" for random forest
> >>> predictions. I
> >>> like the package. However, I have fairly large data sets, and it can
> >>> often
> >>> take *hours* just to go through the "na.roughfix" call, which simply
> >>> goes
> >>> through and cleans up any NA values to either the median (numerical
> >>> data) or
> >>> the most frequent occurrence (factors).
> >>> I am going to start doing some comparisons between na.roughfix() and
> >>> some apply() functions which, it seems, are able to do the same job
> more
> >>> quickly. But I hesitate to duplicate a function that is already in the
> >>> package, since I presume the na.roughfix should be as quick as possible
> >>> and
> >>> it should also be well "tailored" to the requirements of random forest.
> >>>
> >>> Has anyone else seen that this is really slow? (I haven't noticed
> >>> rfImpute to be nearly as slow, but I cannot say for sure: my "predict"
> >>> data
> >>> sets are MUCH larger than my model data sets, so cleaning the
> prediction
> >>> data set simply takes much longer.)
> >>> If so, any ideas how to speed this up?
> >>>
> >>> Thanks!
> >>> Mike
> >>>
> >>>
> >>>
> >>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> >>> Tacoma Narrows bridge collapse explained with abstract phase-space
> maps,
> >>> Some x-ray slides, a music score, Minard's Napoleanic war:
> >>> The most exciting frontier is charting what's already here."
> >>> -- xkcd
> >>>
> >>> --
> >>> Help protect Wikipedia. Donate now:
> >>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> [email protected] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>> Notice: This e-mail message, together with any attachments, contains
> >>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> >>> New Jersey, USA 08889), and/or its affiliates Direct contact
> information
> >>> for affiliates is available at
> >>> http://www.merck.com/contact/contacts.html) that may be confidential,
> >>> proprietary copyrighted and/or legally privileged. It is intended
> solely
> >>> for the use of the individual or entity named on this message. If you
> are
> >>> not the intended recipient, and have received this message in error,
> >>> please notify us immediately by reply e-mail and then delete it from
> >>> your system.
> >>>
> >>>
> >> Notice: This e-mail message, together with any attach...{{dropped:15}}
> >
> > ______________________________________________
> > [email protected] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.