Here's another version that's a bit easier to read: na.roughfix2 <- function (object, ...) { res <- lapply(object, roughfix) structure(res, class = "data.frame", row.names = seq_len(nrow(object))) }
roughfix <- function(x) { missing <- is.na(x) if (!any(missing)) return(x) if (is.numeric(x)) { x[missing] <- median.default(x[!missing]) } else if (is.factor(x)) { freq <- table(x) x[missing] <- names(freq)[which.max(freq)] } else { stop("na.roughfix only works for numeric or factor") } x } I'm cheating a bit because as.data.frame is so slow. Hadley On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is....@gmail.com> wrote: > Jim, Andy, > > Thanks for your suggestions! > > I found some time today to futz around with it, and I found a "home > made" script to fill in NA values to be much quicker. For those who are > interested, instead of using: > > dataSet <- na.roughfix(dataSet) > > > > I used: > > origCols <- names(dataSet) > ## Fix numeric values... > dataSet <- as.data.frame(lapply(dataSet, FUN=function(x) > { > if(!is.numeric(x)) { x } else { > ifelse(is.na(x), median(x, na.rm=TRUE), x) } } > ), > row.names=row.names(dataSet) ) > ## Fix factors... > dataSet <- as.data.frame(lapply(dataSet, FUN=function(x) > { > if(!is.factor(x)) { x } else { > levels(x)[ifelse(!is.na > (x),x,table(max(table(x))) > ) ] } } ), > row.names=row.names(dataSet) ) > names(dataSet) <- origCols > > > > In one case study that I ran, the na.roughfix() algo took 296 seconds > whereas the homemade one above took 16 seconds. > > Regards, > Mike > > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > Tacoma Narrows bridge collapse explained with abstract phase-space maps, > Some x-ray slides, a music score, Minard's Napoleanic war: > The most exciting frontier is charting what's already here." > -- xkcd > > -- > Help protect Wikipedia. Donate now: > http://wikimediafoundation.org/wiki/Support_Wikipedia/en > > > On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_l...@merck.com> wrote: > >> You need to isolate the problem further, or give more detail about your >> data. This is what I get: >> >> R> nr <- 2134 >> R> nc <- 14037 >> R> x <- matrix(runif(nr*nc), nr, nc) >> R> n.na <- round(nr*nc/10) >> R> x[sample(nr*nc, n.na)] <- NA >> R> system.time(x.fixed <- na.roughfix(x)) >> user system elapsed >> 8.44 0.39 8.85 >> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB >> ram. >> >> Andy >> >> ------------------------------ >> *From:* Mike Williamson [mailto:this.is....@gmail.com] >> *Sent:* Thursday, July 01, 2010 12:48 PM >> *To:* Liaw, Andy >> *Cc:* r-help >> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is >> so slow?? >> >> Andy, >> >> You're right, I didn't supply any code, because my call was very simple >> and it was the call itself at question. However, here is the associated >> code I am using: >> >> >> naFixTime <- system.time( { >> if (fltrResponse) { ## TRUE: there are no NA's in the >> response... cleared via earlier steps >> message(paste(iAm,": Missing values will now be >> imputed...\n", sep="")) >> try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet), >> response)], >> dataSet[,response]) ) >> } else { ## In this case, there is no "response" column in the >> data set >> message(paste(iAm,": Missing values will now be filled in >> with median", >> " values or most frequent levels", sep="")) >> try( dataSet <- na.roughfix(dataSet) ) >> } >> } ) >> >> >> >> As you can see, the "na.roughfix" call is made as simply as possible: >> I supply the entire dataSet (only parameters, no responses). I am not doing >> the prediction here (that is done later, and the prediction itself is not >> taking very long). >> Here are some calculation times that I experienced: >> >> # rows # cols time to run na.roughfix >> ======= ======= ==================== >> 2046 2833 ~ 2 minutes >> 2066 5626 ~ 6 minutes >> 2134 14037 ~ 30 minutes >> >> These numbers are on a Windows server using the 64-bit version of 'R'. >> >> Regards, >> Mike >> >> >> "Telescopes and bathyscaphes and sonar probes of Scottish lakes, >> Tacoma Narrows bridge collapse explained with abstract phase-space maps, >> Some x-ray slides, a music score, Minard's Napoleanic war: >> The most exciting frontier is charting what's already here." >> -- xkcd >> >> -- >> Help protect Wikipedia. Donate now: >> http://wikimediafoundation.org/wiki/Support_Wikipedia/en >> >> >> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote: >> >>> You have not shown any code on exactly how you use na.roughfix(), so I >>> can only guess. >>> >>> If you are doing something like: >>> >>> randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) >>> >>> I would not be surprised that it's taking very long on large datasets. >>> Most likely it's caused by the formula interface, not na.roughfix() >>> itself. >>> >>> If that is your case, try doing the imputation beforehand and run >>> randomForest() afterward; e.g., >>> >>> myroughfixed <- na.roughfix(mybigdata) >>> randomForest(myroughfixed[list.of.predictor.columns], >>> myroughfixed[[myresponse]],...) >>> >>> HTH, >>> Andy >>> >>> -----Original Message----- >>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] >>> On Behalf Of Mike Williamson >>> Sent: Wednesday, June 30, 2010 7:53 PM >>> To: r-help >>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so >>> slow?? >>> >>> Hi all, >>> >>> I am using the package "random forest" for random forest >>> predictions. I >>> like the package. However, I have fairly large data sets, and it can >>> often >>> take *hours* just to go through the "na.roughfix" call, which simply >>> goes >>> through and cleans up any NA values to either the median (numerical >>> data) or >>> the most frequent occurrence (factors). >>> I am going to start doing some comparisons between na.roughfix() and >>> some apply() functions which, it seems, are able to do the same job more >>> quickly. But I hesitate to duplicate a function that is already in the >>> package, since I presume the na.roughfix should be as quick as possible >>> and >>> it should also be well "tailored" to the requirements of random forest. >>> >>> Has anyone else seen that this is really slow? (I haven't noticed >>> rfImpute to be nearly as slow, but I cannot say for sure: my "predict" >>> data >>> sets are MUCH larger than my model data sets, so cleaning the prediction >>> data set simply takes much longer.) >>> If so, any ideas how to speed this up? >>> >>> Thanks! >>> Mike >>> >>> >>> >>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes, >>> Tacoma Narrows bridge collapse explained with abstract phase-space maps, >>> Some x-ray slides, a music score, Minard's Napoleanic war: >>> The most exciting frontier is charting what's already here." >>> -- xkcd >>> >>> -- >>> Help protect Wikipedia. Donate now: >>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> Notice: This e-mail message, together with any attachments, contains >>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, >>> New Jersey, USA 08889), and/or its affiliates Direct contact information >>> for affiliates is available at >>> http://www.merck.com/contact/contacts.html) that may be confidential, >>> proprietary copyrighted and/or legally privileged. It is intended solely >>> for the use of the individual or entity named on this message. If you are >>> not the intended recipient, and have received this message in error, >>> please notify us immediately by reply e-mail and then delete it from >>> your system. >>> >>> >> Notice: This e-mail message, together with any attach...{{dropped:15}} > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.