[R] failure with merge
I am merging two data frames: tuneAcc <- structure(list(select = c(FALSE, TRUE), method = structure(c(1L, 1L), .Label = "GCV.Cp", class = "factor"), RMSE = c(29.2102056093962, 28.9743318817886), Rsquared = c(0.0322612161559773, 0.0281713457306074), RMSESD = c(0.981573768028697, 0.791307778398384), RsquaredSD = c(0.0388188469162352, 0.0322578925071113)), .Names = c("select", "method", "RMSE", "Rsquared", "RMSESD", "RsquaredSD"), class = "data.frame", row.names = 1:2) finalTune <- structure(list(select = TRUE, method = structure(1L, .Label = "GCV.Cp", class = "factor"), Selected = "*"), .Names = c("select", "method", "Selected"), row.names = 2L, class = "data.frame") using merge(x = tuneAcc, y = finalTune, all.x = TRUE) The error is "Error in match.arg(method) : 'arg' must be NULL or a character vector" This is R version 3.3.1 (2016-06-21), Platform: x86_64-apple-darwin13.4.0 (64-bit), Running under: OS X 10.11.5 (El Capitan). These do not stop execution: merge(x = tuneAcc, y = finalTune) merge(x = tuneAcc, y = finalTune, all.x = TRUE, sort = FALSE) The latter produces (what I consider to be) incorrect results. Walking through the code, the original call with just `all.x = TRUE` fails when sorting at the line: res <- res[if (all.x || all.y) do.call("order", x[, seq_len(l.b), drop = FALSE]) else sort.list(bx[m$xi]), , drop = FALSE] Specifically, on the `do.call` bit. For these data: Browse[3]> x select method RMSE Rsquared RMSESD RsquaredSD 2 TRUE GCV.Cp 28.97433 0.02817135 0.7913078 0.03225789 1 FALSE GCV.Cp 29.21021 0.03226122 0.9815738 0.03881885 Browse[3]> x[, seq_len(l.b), drop = FALSE] select method 2 TRUE GCV.Cp 1 FALSE GCV.Cp and this line executes: Browse[3]> order(x[, seq_len(l.b), drop = FALSE]) [1] 1 2 3 4 although nrow(x) = 2 so this is an issue. Calling it this way stops execution: Browse[3]> do.call("order", x[, seq_len(l.b), drop = FALSE]) Error in match.arg(method) : 'arg' must be NULL or a character vector Thanks, Max __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Installing Caret
The problem is not with `caret. Your output says: > installation of package ‘minqa’ had non-zero exit status `caret` has a dependency that has a dependency on `minqa`. The same is true for `RcppEigen` and the others. What code did you use to do the install? What OS and version or R etc? On Thu, Jun 16, 2016 at 4:49 AM, TJUN KIAT TEOwrote: > I am trying to install the package but am I keep getting this error > messages > > > > installation of > package ‘minqa’ had non-zero exit status > > 2: In install.packages("caret", repos = > "http://cran.stat.ucla.edu/;) : > > installation of > package ‘RcppEigen’ had non-zero exit status > > 3: In install.packages("caret", repos = "http://cran.stat.ucla.edu/;) > : > > installation of > package ‘SparseM’ had non-zero exit status > > 4: In install.packages("caret", repos = > "http://cran.stat.ucla.edu/;) : > > installation of > package ‘lme4’ had non-zero exit status > > 5: In install.packages("caret", repos = > "http://cran.stat.ucla.edu/;) : > > installation of > package ‘quantreg’ had non-zero exit status > > 6: In install.packages("caret", repos = > "http://cran.stat.ucla.edu/;) : > > installation of > package ‘pbkrtest’ had non-zero exit status > > 7: In install.packages("caret", repos = > "http://cran.stat.ucla.edu/;) : > > installation of > package ‘car’ had non-zero exit status > > 8: In install.packages("caret", repos = > "http://cran.stat.ucla.edu/;) : > > installation of > package ‘caret’ had non-zero exit status > > > Anyone has any idea what wrong? > > Tjun Kiat > > > > [[alternative HTML version deleted]] > > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem while predicting in regression trees
I've brought this up numerous times... you shouldn't use `predict.rpart` (or whatever modeling function) from the `finalModel` object. That object has no idea what was done to the data prior to its invocation. The issue here is that `train(formula)` converts the factors to dummy variables. `rpart` does not require that and the `finalModel` object has no idea that that happened. Using `predict.train` works just fine so why not use it? > table(predict(tr_m, newdata = testPFI)) -2617.42857142857 -1786.76923076923 -1777.583 -1217.3 3 3 6 3 -886.6667 -408.375-375.7 -240.307692307692 5 1 4 5 -201.612903225806 -19.6071428571429 30.80833 43.9 307266 9 151.5 209.647058823529 628 On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal < muhammad2.bi...@live.uwe.ac.uk> wrote: > Please find the sample dataset attached along with R code pasted below to > reproduce the issue. > > > #Loading the data frame > > pfi <- read.csv("pfi_data.csv") > > #Splitting the data into training and test sets > split <- sample.split(pfi, SplitRatio = 0.7) > trainPFI <- subset(pfi, split == TRUE) > testPFI <- subset(pfi, split == FALSE) > > #Cross validating the decision trees > tr.control <- trainControl(method="repeatedcv", number=20) > cp.grid <- expand.grid(.cp = (0:10)*0.001) > tr_m <- train(project_delay ~ project_lon + project_lat + project_duration > + sector + contract_type + capital_value, data = trainPFI, method="rpart", > trControl=tr.control, tuneGrid = cp.grid) > > #Displaying the train results > tr_m > > #Fetching the best tree > best_tree <- tr_m$finalModel > > #Plotting the best tree > prp(best_tree) > > #Using the best tree to make predictions *[This command raises the error]* > best_tree_pred <- predict(best_tree, newdata = testPFI) > > #Calculating the SSE > best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2) > > # > tree_pred.sse > > ... > > Many Thanks and > > > Kind Regards > > > > -- > Muhammad Bilal > Research Fellow and Doctoral Researcher, > Bristol Enterprise, Research, and Innovation Centre (BERIC), > University of the West of England (UWE), > Frenchay Campus, > Bristol, > BS16 1QY > > *muhammad2.bi...@live.uwe.ac.uk* <olugbenga2.akin...@live.uwe.ac.uk> > > > -- > *From:* Max Kuhn <mxk...@gmail.com> > *Sent:* 09 May 2016 17:22:22 > *To:* Muhammad Bilal > *Cc:* Bert Gunter; r-help@r-project.org > > *Subject:* Re: [R] Problem while predicting in regression trees > > It is extremely difficult to tell what the issue might be without a > reproducible example. > > The only thing that I can suggest is to use the non-formula interface to > `train` so that you can avoid creating dummy variables. > > On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal < > muhammad2.bi...@live.uwe.ac.uk> wrote: > >> Hi Bert, >> >> Thanks for the response. >> >> I checked the datasets, however, the Hospitals level appears in both of >> them. See the output below: >> >> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector") >> sector count(*) >> 1 Defense9 >> 2Hospitals 101 >> 3 Housing 32 >> 4 Others 99 >> 5 Public Buildings 39 >> 6 Schools 148 >> 7 Social Care 10 >> 8 Transportation 27 >> 9Waste 26 >> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector") >> sector count(*) >> 1 Defense5 >> 2Hospitals 47 >> 3 Housing 11 >> 4 Others 44 >> 5 Public Buildings 18 >> 6 Schools 69 >> 7 Social Care9 >> 8 Transportation8 >> 9Waste 12 >> >> Any thing else to try? >> >> -- >> Muhammad Bilal >> Research Fellow and Doctoral Researcher, >> Bristol Enterprise, Research, and Innovation Centre (BERIC), >> University of the West of England (UWE), >> Frenchay Campus, >> Bristol, >> BS16 1QY >> >> muhammad2.bi...@live.uwe.ac.uk >> >> >> >> From: Bert Gunter <bgunter.4...@gmail.com>
Re: [R] Problem while predicting in regression trees
It is extremely difficult to tell what the issue might be without a reproducible example. The only thing that I can suggest is to use the non-formula interface to `train` so that you can avoid creating dummy variables. On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal < muhammad2.bi...@live.uwe.ac.uk> wrote: > Hi Bert, > > Thanks for the response. > > I checked the datasets, however, the Hospitals level appears in both of > them. See the output below: > > > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector") > sector count(*) > 1 Defense9 > 2Hospitals 101 > 3 Housing 32 > 4 Others 99 > 5 Public Buildings 39 > 6 Schools 148 > 7 Social Care 10 > 8 Transportation 27 > 9Waste 26 > > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector") > sector count(*) > 1 Defense5 > 2Hospitals 47 > 3 Housing 11 > 4 Others 44 > 5 Public Buildings 18 > 6 Schools 69 > 7 Social Care9 > 8 Transportation8 > 9Waste 12 > > Any thing else to try? > > -- > Muhammad Bilal > Research Fellow and Doctoral Researcher, > Bristol Enterprise, Research, and Innovation Centre (BERIC), > University of the West of England (UWE), > Frenchay Campus, > Bristol, > BS16 1QY > > muhammad2.bi...@live.uwe.ac.uk > > > > From: Bert Gunter> Sent: 09 May 2016 01:42:39 > To: Muhammad Bilal > Cc: r-help@r-project.org > Subject: Re: [R] Problem while predicting in regression trees > > It seems that the data that you used for prediction contained a level > "Hospitals" for the sector factor that did not appear in the training > data (or maybe it's the other way round). Check this. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal > wrote: > > Hi All, > > > > I have the following script, that raises error at the last command. I am > new to R and require some clarification on what is going wrong. > > > > #Creating the training and testing data sets > > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7) > > trainPFI <- subset(pfi_v3, splitFlag==TRUE) > > testPFI <- subset(pfi_v3, splitFlag==FALSE) > > > > > > #Structure of the trainPFI data frame > >> str(trainPFI) > > *** > > 'data.frame': 491 obs. of 16 variables: > > $ project_id : int 1 2 3 6 7 9 10 12 13 14 ... > > $ project_lat: num 51.4 51.5 52.2 51.9 52.5 ... > > $ project_lon: num -0.642 -1.85 0.08 -0.401 -1.888 ... > > $ sector : Factor w/ 9 levels "Defense","Hospitals",..: > 4 4 4 6 6 6 6 6 6 6 ... > > $ contract_type : chr "Turnkey" "Turnkey" "Turnkey" "Turnkey" > ... > > $ project_duration : int 1826 3652 121 730 730 790 522 819 998 > 372 ... > > $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ... > > $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 > 60.5 78 ... > > $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ... > > $ delay_type : Ord.factor w/ 9 levels "7 months early & > beyond"<..: 1 5 3 5 5 5 2 5 5 6 ... > > > > library(caret) > > library(e1071) > > > > set.seed(100) > > > > tr.control <- trainControl(method="cv", number=10) > > cp.grid <- expand.grid(.cp = (0:10)*0.001) > > > > #Fitting the model using regression tree > > tr_m <- train(project_delay ~ project_lon + project_lat + > project_duration + sector + contract_type + capital_value, data = trainPFI, > method="rpart", trControl=tr.control, tuneGrid = cp.grid) > > > > tr_m > > > > CART > > 491 samples > > 15 predictor > > No pre-processing > > Resampling: Cross-Validated (10 fold) > > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ... > > Resampling results across tuning parameters: > > cp RMSE Rsquared > > 0.000 441.1524 0.5417064 > > 0.001 439.6319 0.5451104 > > 0.002 437.4039 0.5487203 > > 0.003 432.3675 0.551 > > 0.004 434.2138 0.5519964 > > 0.005 431.6635 0.551 > > 0.006 436.6163 0.5474135 > > 0.007 440.5473 0.5407240 > > 0.008 441.0876 0.5399614 > > 0.009 441.5715 0.5401718 > > 0.010 441.1401 0.5407121 > > RMSE was used to select the optimal model using the smallest value. > > The final value used for the model was cp = 0.005. > > > > #Fetching the best tree > > best_tree <- tr_m$finalModel > > > > Alright, all the aforementioned commands worked fine. > > > > Except the subsequent command raises error, when the developed model is > used to make predictions: > > best_tree_pred <- predict(best_tree, newdata = testPFI) > >
Re: [R] Mixture Discriminant Analysis and Penalized LDA
There is a function called `smda` in the sparseLDA package that implements the model described in Clemmensen, L., Hastie, T., Witten, D. and Ersbøll, B. Sparse discriminant analysis, Technometrics, 53(4): 406-413, 2011 Max On Sun, Jan 24, 2016 at 10:45 PM, TJUN KIAT TEOwrote: > Hi > > I noticed we have MDA and Mclust for Mixture Discriminant Analysis and > Penalized LDA. Do we have a R packages for Penalized MDA? > > Tjun Kiat > > > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Caret - Recursive Feature Elimination Error
Providing a reproducible example and the results of `sessionInfo` will help get your question answered. Also, what is the point of using glmnet with RFE? It already does feature selection. On Wed, Dec 23, 2015 at 1:48 AM, Manish MAHESHWARIwrote: > Hi, > > I am trying to use caret, for feature selection on glmnet. I get a strange > error like below - "arguments imply differing number of rows: 2, 3". > > > x <- data.matrix(train[,features]) > > y <- train$quoteconversion_flag > > > > > str(x) > > num [1:260753, 1:297] NA NA NA NA NA NA NA NA NA NA ... > > - attr(*, "dimnames")=List of 2 > > ..$ : NULL > > ..$ : chr [1:297] "original_quote_date" "field6" "field7" "field8" ... > > > str(y) > > Factor w/ 2 levels "X0","X1": 1 1 1 1 1 1 1 1 1 1 ... > > > RFE <- rfe(x,y,sizes = seq(50,300,by=10), > +metric = "ROC",maximize=TRUE,rfeControl = MyRFEcontrol, > +method='glmnet', > +tuneGrid = expand.grid(.alpha=0,.lambda=c(0.01,0.02)), > +trControl = MyTrainControl) > +(rfe) fit Resample01 size: 297 > +(rfe) fit Resample02 size: 297 > +(rfe) fit Resample03 size: 297 > +(rfe) fit Resample04 size: 297 > +(rfe) fit Resample05 size: 297 > +(rfe) fit Resample06 size: 297 > +(rfe) fit Resample07 size: 297 > +(rfe) fit Resample08 size: 297 > +(rfe) fit Resample09 size: 297 > +(rfe) fit Resample10 size: 297 > +(rfe) fit Resample11 size: 297 > +(rfe) fit Resample12 size: 297 > +(rfe) fit Resample13 size: 297 > +(rfe) fit Resample14 size: 297 > +(rfe) fit Resample15 size: 297 > +(rfe) fit Resample16 size: 297 > +(rfe) fit Resample17 size: 297 > +(rfe) fit Resample18 size: 297 > +(rfe) fit Resample19 size: 297 > +(rfe) fit Resample20 size: 297 > +(rfe) fit Resample21 size: 297 > +(rfe) fit Resample22 size: 297 > +(rfe) fit Resample23 size: 297 > +(rfe) fit Resample24 size: 297 > +(rfe) fit Resample25 size: 297 > Error in { : > task 1 failed - "task 1 failed - "arguments imply differing number of > rows: 2, 3"" > In addition: There were 50 or more warnings (use warnings() to see the > first 50) > > Any idea what does this mean? > > Thanks, > Manish > > CONFIDENTIAL NOTE: > The information contained in this email is intended on...{{dropped:13}} __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in 'Contrasts<-' while using GBM.
Providing a reproducible example and the results of `sessionInfo` will help get your question answered. My only guess is that one or more of your predictors are factors and that the in-sample data (used to build the model during resampling) have different levels than the holdout samples. Max On Sat, Nov 28, 2015 at 10:04 PM, Karteek Pradyumna Bulusu < kartikpradyumn...@gmail.com> wrote: > Hey, > > I was trying to implement Stochastic Gradient Boosting in R. Following is > my code in rstudio: > > > > library(caret); > > library(gbm); > > library(plyr); > > library(survival); > > library(splines); > > library(mlbench); > > set.seed(35); > > stack = read.csv("E:/Semester 3/BDA/PROJECT/Sample_SO.csv", head > =TRUE,sep=","); > > dim(stack); #displaying dimensions of the dataset > > > > #SPLITTING TRAINING AND TESTING SET > > totraining <- createDataPartition(stack$ID, p = .6, list = FALSE); > > training <- stack[ totraining,] > > test <- stack[-totraining,] > > > > #PARAMETER SETTING > > t_control <- trainControl(method = "cv", number = 10); > > > > > > # GLM > > start <- proc.time(); > > > > glm = train(ID ~ ., data = training, > > method = "gbm", > > metric = "ROC", > > trControl = t_control, > > verbose = FALSE) > > > > When I am compiling last line, I am getting following error: > > > > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > > contrasts can be applied only to factors with 2 or more levels > > > > > > Can anyone tell me where I am going wrong and How to rectify it. It’ll be > greatful. > > > > Thank you. Looking forward to it. > > > > Regards, > Karteek Pradyumna Bulusu. > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Ensure distribution of classes is the same as prior distribution in Cross Validation
Right now, using `method = "cv"` or `method = "repeatedcv"` does stratified sampling. Depending on what you mean by "ensure" and the nature of your outcome (categorical?), it probably already does. On Mon, Nov 23, 2015 at 7:04 PM, TJUN KIAT TEOwrote: > In the caret train control function, is it possible to ensure Ensure > distribution of classes is the same as prior distribution in the folds of > cross > validation? I know it can be done using create folds but was wondering if > it is possible using train control? > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Caret Internal Data Representation
Providing a reproducible example and the results of `sessionInfo` will help get your question answered. For example, did you use the formula or non-formula interface to `train` and so on On Thu, Nov 5, 2015 at 1:10 PM, Bert Gunterwrote: > I am not familiar with caret/Cubist, but assuming they follow the > usual R procedures that encode categorical factors for conditional > fitting, you need to do some homework on your own by reading up on the > use of contrasts in regression. > > See ?factor and ?contrasts (and other linked Help as necessary) to see > what are R's usual procedures, but you will undoubtedly need to > consult outside statistical references -- the help files will point > you to some -- to fully understand what's going on. It is not trivial. > > Cheers, > Bert > Bert Gunter > > "Data is not information. Information is not knowledge. And knowledge > is certainly not wisdom." >-- Clifford Stoll > > > On Thu, Nov 5, 2015 at 9:38 AM, Lorenzo Isella > wrote: > > Dear All, > > I have a data set which contains both categorical and numerical > > variables which I analyze using Cubist+the caret framework. > > Now, from the generated rules, it is clear that cubist does something > > to the categorical variables and probably uses some dummy coding for > > them. > > However, I cannot right now access the data the way it is transformed > > by cubist. > > If caret (or the package) need to do some dummy coding of the factors, > > how can I access the newly encoded data set? > > I suppose this applies to plenty of other packages. > > Any suggestion is welcome. > > Cheers > > > > Lorenzo > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Imbalanced random forest
This might help: http://bit.ly/1MUP0Lj On Wed, Jul 29, 2015 at 11:00 AM, jpara3 j.para.fernan...@hotmail.com wrote: ¿How can i set up a study with random forest where the response is highly imbalanced? - Guided Tours Basque Country Guided tours in the three capitals of the Basque Country: Bilbao, Vitoria-Gasteiz and San Sebastian, as well as in their provinces. Available languages. Travel planners for groups and design of tourist routes across the Basque Country. -- View this message in context: http://r.789695.n4.nabble.com/Imbalanced-random-forest-tp4710524.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] what constitutes a 'complete sentence'?
On Tue, Jul 7, 2015 at 8:19 AM, John Fox j...@mcmaster.ca wrote: Dear Peter, You're correct that these examples aren't verb phrases (though the second one contains a verb phrase). I don't want to make the discussion even more pedantic (moving it in this direction was my fault), but Paragraph isn't quite right, unless explained, because conventionally a paragraph consists of sentences. How about something like this? One can use several complete sentences or punctuated telegraphic phrases, but only one paragraph (that is, block of continuous text with no intervening blank lines). The description should end with a full stop (period). Before we start crafting better definitions of the rule, it seems important to understand what issue we are trying to solve. I don't see any place where this has been communicated. As I said previously, I usually give them the benefit of the doubt. However, this requirement is poorly implemented and we need to know more. For example, does CRAN need to parse the text and the code failed because there was no period? It seems plausible that someone could have worded that requirement in the current form, but it is poorly written (which is unusual). If the goal is to improve the quality of the description text, then that is a more difficult issue to define. and good luck coding your way into a lucid and effective set of rules. It also seems a bit over the top to me and a poor choice of where everyone should be spending their time. What are we trying to fix? It would likely be helpful to add some examples of good and bad descriptions, and to explain how the check actually works. Best, John On Tue, 7 Jul 2015 12:20:38 +0200 peter dalgaard pda...@gmail.com wrote: ...except that there is not necessarily a verb either. What we're looking for is something like advertisement style as in UGLY MUGS 7.95. An invaluable addition to your display cabinet. Comes in an assortment of warts and wrinkles, crafted by professional artist Foo Yung. However, I'm drawing blanks when searching for an established term for it. Could we perhaps sidestep the issue by requesting a single descriptive paragraph, with punctuation or thereabouts? I'm still puzzled about what threw Federico's example in the first place. The actual code is if(strict !is.na(val - db[Description]) !grepl([.!?]['\)]?$, trimws(val))) out$bad_Description - TRUE and I can do this strict - TRUE db - tools:::.read_description(/tmp/dd) if(strict !is.na(val - db[Description]) + !grepl([.!?]['\)]?$, trimws(val))) + out$bad_Description - TRUE out Error: object 'out' not found I.e., the complaint should _not_ be triggered. I suppose that something like a non-breakable space at the end could confuse trimws(), but beyond that I'm out of ideas. On 07 Jul 2015, at 03:28 , John Fox j...@mcmaster.ca wrote: Dear Peter, I think that the grammatical term you're looking for is verb phrase. Best, John On Tue, 7 Jul 2015 00:12:25 +0200 peter dalgaard pda...@gmail.com wrote: On 06 Jul 2015, at 23:19 , Duncan Murdoch murdoch.dun...@gmail.com wrote: On 06/07/2015 5:09 PM, Rolf Turner wrote: On 07/07/15 07:10, William Dunlap wrote: [Rolf Turner wrote.] The CRAN guidelines should be rewritten so that they say what they *mean*. If a complete sentence is not actually required --- and it seems abundantly clear that it is not --- then guidelines should not say so. Rather they should say, clearly and comprehensibly, what actually *is* required. This may be true, but also think of the user when you write the description. If you are scanning a long list of descriptions looking for a package to use, seeing a description that starts with 'A package for' just slows you down. Seeing a description that includes 'designed to' leaves you wondering if the implementation is woefully incomplete. You want to go beyond what CRAN can test for. All very true and sound and wise, but what has this got to do with complete sentences? The package checker issues a message saying that it wants a complete sentence when this has nothing to do with what it *really* wants. That's false. If you haven't given a complete sentence, you might still pass, but if you have, you will pass. That's not nothing to do with what it really wants, it's just an imperfect test that fails to detect violations of the guidelines. As we've seen, it sometimes also makes mistakes in the other direction. I'd say those are more serious. Duncan Murdoch Ackchewly I don't think what we want is what we say that we want. A quick check suggests that many/most packages use headline speech, as in Provides functions for analysis of foo, with special emphasis on bar., which seems perfectly ok. As others have
Re: [R] Caret and custom summary function
The version of caret just put on CRAN has a function called mnLogLoss that does this. Max On Mon, May 11, 2015 at 11:17 AM, Lorenzo Isella lorenzo.ise...@gmail.com wrote: Dear All, I am trying to implement my own metric (a log loss metric) for a binary classification problem in Caret. I must be making some mistake, because I cannot get anything sensible out of it. I paste below a numerical example which should run in more or less one minute on any laptop. When I run it, I finally have an output of the kind Aggregating results Something is wrong; all the LogLoss metric values are missing: LogLoss Min. : NA 1st Qu.: NA Median : NA Mean :NaN 3rd Qu.: NA Max. : NA NA's :40 Error in train.default(x, y, weights = w, ...) : Stopping In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures. Any suggestion is appreciated. Many thanks Lorenzo เเ library(caret) library(C50) LogLoss - function (data, lev = NULL, model = NULL) { probs - pmax(pmin(as.numeric(data$T), 1 - 1e-15), 1e-15) logPreds - log(probs) log1Preds - log(1 - probs) real - (as.numeric(data$obs) - 1) out - c(mean(real * logPreds + (1 - real) * log1Preds)) * -1 names(out) - c(LogLoss) out } train - matrix(ncol=5,nrow=200,NA) train - as.data.frame(train) names(train) - c(donation, x1,x2,x3,x4) set.seed(134) sel - sample(nrow(train), 0.5*nrow(train)) train$donation[sel] - yes train$donation[-sel] - no train$x1 - seq(nrow(train)) train$x2 - rnorm(nrow(train)) train$x3 - 1/train$x1 train$x4 - sample(nrow(train)) train$donation - as.factor(train$donation) c50Grid - expand.grid(trials = 1:10, model = c( tree ,rules ),winnow = c(TRUE, FALSE )) tc - trainControl(method = repeatedCV, summaryFunction=LogLoss, number = 10, repeats = 10, verboseIter=TRUE, classProbs=TRUE) model - train(donation~., data=train, method=C5.0, trControl=tc, metric=LogLoss, maximize=FALSE, tuneGrid=c50Grid) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Repeated failures to install caret package (of Max Kuhn)
I thought that this might be relevant: https://stackoverflow.com/questions/28985759/cant-install-the-caret-package-in-r-in-my-linux-machine but it seems that you installed nloptr. I would also suggest doing the install in base R and trying a different mirror. I would avoid installing via RStudio unless you have just started a new R session. On Sat, Apr 4, 2015 at 11:11 AM, John Kane jrkrid...@inbox.com wrote: Try installing from somewhere outside of RStudio or reboot and retry in RStudio. I find that if RStudio is open for a long time I occasionally get some weird (buggy?) results but I cannot reproduce to send in an bug report. Load R and from the command line or Windows RGui try installing. As a test I just installed it successully with the command install.packages(caret) executed in R (using gedit with its R-plug-in) and running Ubuntu 14.04 For future reference: Reproducibility https://github.com/hadley/devtools/wiki/Reproducibility http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example John Kane Kingston ON Canada -Original Message- From: wyl...@ischool.utexas.edu Sent: Fri, 03 Apr 2015 16:07:57 -0500 To: r-help@r-project.org Subject: [R] Repeated failures to install caret package (of Max Kuhn) For an edx course, MIT's The Analtics Edge, I need to install the caret package that was originated and is maintained by Dr. Max Kuhn of Pfizer. So far, every effort I've made to try to install.packages(caret) has failed. (I'm using R v. 3.1.3 and RStudio v. 0.98.1103 in LinuxMint 17.1) Here are some of the things I've tried unsuccessfully: install.packages(caret, repos=c(http://rstudio.org/_packages;, http://cran.rstudio.com;)) install.packages(caret, dependencies=TRUE) install.packages(caret, repos=c(http://rstudio.org/_packages;, http://cran.rstudio.com;), dependencies=TRUE) install.packages(caret, dependencies = c(Depends, Suggests)) install.packages(caret, repos=http://cran.rstudio.com/;) I've changed my CRAN mirror from UCLA to Revolution Analytics in Dallas, and tried the above installs again, unsuccessfully. I've succeeded in individually installing a number of packages on which caret appears to be dependent. Specifically, I've been able to install nloptr, minqa, Rcpp, reshape2, stringr, and scales. But I've had no success with trying to do individual installs of BradleyTerry2, car, lme4, quantreg, and RcppEigen. Any suggestions will be very gratefully received (and tried out quickly). Thanks in advance. Ron Wyllys __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks orcas on your desktop! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] #library(CHAID) - Cross validation for chaid
You can create your own: http://topepo.github.io/caret/custom_models.html I put a prototype together. Source this file: https://github.com/topepo/caret/blob/master/models/files/chaid.R then try this: library(CHAID) ### fit tree to subsample set.seed(290875) USvoteS - USvote[sample(1:nrow(USvote), 1000),] ## You probably don't want to use `train.formula` as ## it will convert the factors to dummy variables mod - train(x = USvoteS[,-1], y = USvoteS$vote3, method = modelInfo, trControl = trainControl(method = cv)) Max On Mon, Jan 5, 2015 at 7:11 AM, Rodica Coderie via R-help r-help@r-project.org wrote: Hello, Is there an option of cross validation for CHAID decision tree? An example of CHAID is below: library(CHAID) example(chaid, package = CHAID) How can I use a 10 fold cross-validation for CHAID? I've read that caret package is to cross-validate on many times of models, but model CHAID is not in caret's built-in library. library(caret) model - train(vote3 ~., data = USvoteS, method='CHAID', tuneLength=10,trControl=trainControl(method='cv', number=10, classProbs=TRUE, summaryFunction=twoClassSummary)) Thanks, Rodica __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with caret, please
What you are asking is a bad idea on multiple levels. You will grossly over-estimate the area under the ROC curve. Consider the 1-NN model: you will have perfect predictions every time. To do this, you will need to run train again and modify the index and indexOut objects: library(caret) set.seed(1) dat - twoClassSim(200) set.seed(2) folds - createFolds(dat$Class, returnTrain = TRUE) Control - trainControl(method=cv, summaryFunction=twoClassSummary, classProb=T, index = folds, indexOut = folds) tGrid=data.frame(k=1:100) set.seed(3) a_bad_idea - train(Class ~ ., data=dat, method = knn, tuneGrid=tGrid, trControl=Control, metric = ROC) Max On Sat, Oct 11, 2014 at 7:58 PM, Iván Vallés Pérez ivanvallespe...@gmail.com wrote: Hello, I am using caret package in order to train a K-Nearest Neigbors algorithm. For this, I am running this code: Control - trainControl(method=cv, summaryFunction=twoClassSummary, classProb=T) tGrid=data.frame(k=1:100) trainingInfo - train(Formula, data=trainData, method = knn,tuneGrid=tGrid, trControl=Control, metric = ROC) As you can see, I am interested in obtain the AUC parameter of the ROC. This code works good but returns the testing error (which the algorithm uses for tuning the k parameter of the model) as the mean of the error of the CrossValidation folds. I am interested in return, in addition of the testing error, the trainingerror (the mean across each fold of the error obtained with the training data). ¿How can I do it? Thank you [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Training a model using glm
You have not shown all of your code and it is difficult to diagnose the issue. I assume that you are using the data from: library(AppliedPredictiveModeling) data(AlzheimerDisease) If so, there is example code to analyze these data in that package. See ?scriptLocation. We have no idea how you got to the `training` object (package versions would be nice too). I suspect that Dennis is correct. Try using more normal syntax without the $ indexing in the formula. I wouldn't say it is (absolutely) wrong but it doesn't look right either. Max On Wed, Sep 17, 2014 at 2:04 PM, Mohan Radhakrishnan radhakrishnan.mo...@gmail.com wrote: Hi Dennis, Why is there that warning ? I think my syntax is right. Isn't it not? So the warning can be ignored ? Thanks, Mohan On Wed, Sep 17, 2014 at 9:48 PM, Dennis Murphy djmu...@gmail.com wrote: No reproducible example (i.e., no data) supplied, but the following should work in general, so I'm presuming this maps to the caret package as well. Thoroughly untested. library(caret)# something you failed to mention ... modelFit - train(diagnosis ~ ., data = training1)# presumably a logistic regression confusionMatrix(test1$diagnosis, predict(modelFit, newdata = test1, type = response)) For GLMs, there are several types of possible predictions. The default is 'link', which associates with the linear predictor. caret may have a different syntax so you should check its help pages re the supported predict methods. Hint: If a function takes a data = argument, you don't need to specify the variables as components of the data frame - the variable names are sufficient. You should also do some reading to understand why the model formula I used is correct if you're modeling one variable as response and all others in the data frame as covariates. Dennis On Tue, Sep 16, 2014 at 11:15 PM, Mohan Radhakrishnan radhakrishnan.mo...@gmail.com wrote: I answered this question which was part of the online course correctly by executing some commands and guessing. But I didn't get the gist of this approach though my R code works. I have a training and test dataset. nrow(training) [1] 251 nrow(testing) [1] 82 head(training1) diagnosisIL_11IL_13IL_16 IL_17E IL_1alpha IL_3 IL_4 6 Impaired 6.103215 1.282549 2.671032 3.637051 -8.180721 -3.863233 1.208960 10 Impaired 4.593226 1.269463 3.476091 3.637051 -7.369791 -4.017384 1.808289 11 Impaired 6.919778 1.274133 2.154845 4.749337 -7.849364 -4.509860 1.568616 12 Impaired 3.218759 1.286356 3.593860 3.867347 -8.047190 -3.575551 1.916923 13 Impaired 4.102821 1.274133 2.876338 5.731246 -7.849364 -4.509860 1.808289 16 Impaired 4.360856 1.278484 2.776394 5.170380 -7.662778 -4.017384 1.547563 IL_5 IL_6 IL_6_Receptor IL_7 IL_8 6 -0.4004776 0.1856864 -0.51727788 2.776394 1.708270 10 0.1823216 -1.53427580.09668586 2.154845 1.701858 11 0.1823216 -1.09654120.35404039 2.924466 1.719944 12 0.3364722 -0.39871860.09668586 2.924466 1.675557 13 0.000 0.4223589 -0.53219115 1.564217 1.691393 16 0.2623643 0.42235890.18739989 1.269636 1.705116 The testing dataset is similar with 13 columns. Number of rows vary. training1 - training[,grepl(^IL|^diagnosis,names(training))] test1 - testing[,grepl(^IL|^diagnosis,names(testing))] modelFit - train(training1$diagnosis ~ training1$IL_11 + training1$IL_13 + training1$IL_16 + training1$IL_17E + training1$IL_1alpha + training1$IL_3 + training1$IL_4 + training1$IL_5 + training1$IL_6 + training1$IL_6_Receptor + training1$IL_7 + training1$IL_8,method=glm,data=training1) confusionMatrix(test1$diagnosis,predict(modelFit, test1)) I get this error when I run the above command to get the confusion matrix. *'newdata' had 82 rows but variables found have 251 rows '* I thought this was simple. I train a model using the training dataset and predict using the test dataset and get the accuracy. Am I missing the obvious here ? Thanks, Mohan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version
Re: [R] Use of library(X) in the code of library X.
That is legacy code but there was a good reason back then. caret is written to use parallel processing via the foreach package. There were some cases where the worker processes did not load the required packages (even when I used foreach's .packages argument) so I would do it explicitly. I don't recall which parallel backend had the issue. The more important lesson is that if you want to understand some R code written by others you'll learn more bad habits than good ones if you examine my packages… Max On Fri, Jun 6, 2014 at 2:42 PM, Duncan Murdoch murdoch.dun...@gmail.com wrote: On 06/06/2014 10:26 AM, Bart Kastermans wrote: To improve my R skills I try to understand some R code written by others. Mostly I am looking at the code of packages I use. Today I looked at the code for the caret package http://cran.r-project.org/src/contrib/caret_6.0-30.tar.gz in particular at the file R/adaptive.R This file starts with: adaptiveWorkflow - function(x, y, wts, info, method, ppOpts, ctrl, lev, metric, maximize, testing = FALSE, ...) { library(caret) loadNamespace(caret”) From ?library and googling I can’t figure out what this code would do. Why would you call library(caret) in the caret package? I don't know that package, and since adaptiveWorkflow is not documented at the user level, I can't tell exactly what the author had in mind. However, code like that could be present for debugging purposes (and is unintentionally present in the CRAN copy), or could be intentional. The library(caret) call has the effect of ensuring that the package is on the search list. (It might have been loaded invisibly by another package.) This is generally considered to be bad form nowadays; packages should function properly without being on the search list. I can't think of a situation where loadNamespace() would do anything --- it would have been called by library(). Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cforest sampling methods
You might look at the 'bag' function in the caret package. It will not do the subsampling of variables at each split but you can bag a tree and down-sample the data at each iteration. The help page has an examples bagging ctree (although you might want to play with the tree depth a little). Max On Wed, Mar 19, 2014 at 3:32 PM, Maggie Makar maggieyma...@gmail.com wrote: Hi all, I've been using the randomForest package and I'm trying to make the switch over to party. My problem is that I have an extremely unbalanced outcome (only 1% of the data has a positive outcome) which makes resampling methods necessary. randomForest has a very useful argument that is sampsize which allows me to use a balanced subsample to build each tree in my forest. lets say the number of positive cases is 100, my forest would look something like this: rf-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100, 100)) so I use 100 cases and 100 controls to build each individual tree. Can I do the same for cforests? I know I can always upsample but I'd rather not. I've tried playing around with the weights argument but I'm either not getting it right or it's just the wrong thing to use. Any advice on how to adapt cforests to datasets with imbalanced outcomes is greatly appreciated... Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how is the model resample performance calculated by caret?
On Fri, Feb 28, 2014 at 1:13 AM, zhenjiang zech xu zhenjiang...@gmail.com wrote: Dear all, I did a 5-repeat of 10-fold cross validation using partial least square regression model provided by caret package. Can anyone tell me how are the values in plsTune$resample calculated? Is that predicted on each hold-out set using the model which is trained on the rest data with the optimized parameter tuned from previous cross validation? Yes, those values are the performance estimates across each hold-out using the final model. There is an option in trainControl() that will have it return the resamples from all models too. So in the following example, firstly, 5-repeat of 10-fold cross validation gives 2 for ncomp as the best, and then using ncomp of 2 and the training data to build a model and then predict the hold-out data with the model to give a RMSE and RSQUARE - is what I am thinking true? It is. Max plsTune 524 samples 615 predictors Pre-processing: centered, scaled Resampling: Cross-Validation (10 fold, repeated 5 times) Summary of sample sizes: 472, 472, 471, 471, 471, 471, ... Resampling results across tuning parameters: ncomp RMSE Rsquared RMSE SD Rsquared SD 1 16.8 0.434 1.47 0.0616 2 14.3 0.612 2.21 0.0768 3 13.5 0.704 6.33 0.145 4 14.6 0.706 9.29 0.163 5 15.2 0.703 10.9 0.172 6 16.5 0.69 13.4 0.181 7 18.4 0.672 17.8 0.194 8 200.651 20.4 0.199 9 20.9 0.634 20.9 0.199 10 22.1 0.613 22.1 0.197 11 23.3 0.599 23.8 0.198 12 240.588 24.7 0.198 13 24.9 0.572 25.2 0.197 14 25.8 0.557 26.2 0.194 15 26.2 0.544 25.8 0.191 16 26.6 0.532 25.5 0.187 RMSE was used to select the optimal model using the one SE rule. The final value used for the model was ncomp = 2. plsTune$resample ncomp RMSE RsquaredResample 1 2 13.61569 0.6349700 Fold06.Rep4 2 2 16.02091 0.5808985 Fold05.Rep1 3 2 12.59985 0.6008357 Fold03.Rep5 4 2 13.20069 0.6296245 Fold02.Rep3 5 2 12.43419 0.6560434 Fold04.Rep2 6 2 15.36510 0.5954177 Fold04.Rep5 7 2 12.70028 0.6894489 Fold03.Rep2 8 2 13.34882 0.6468300 Fold09.Rep3 9 2 14.80217 0.5575010 Fold08.Rep3 10 2 19.03705 0.4907630 Fold05.Rep4 11 2 14.26704 0.6579390 Fold10.Rep2 12 2 13.79060 0.5806663 Fold05.Rep3 13 2 14.83641 0.5918039 Fold05.Rep2 14 2 12.48721 0.7011439 Fold01.Rep3 15 2 14.98765 0.5866102 Fold07.Rep4 16 2 10.88100 0.7597167 Fold06.Rep1 17 2 13.60705 0.6321377 Fold08.Rep5 18 2 13.42618 0.6136031 Fold08.Rep4 19 2 13.26066 0.6784586 Fold07.Rep1 20 2 13.20623 0.6812341 Fold03.Rep3 21 2 18.54275 0.4404729 Fold08.Rep2 22 2 11.80312 0.7177681 Fold05.Rep5 23 2 18.56271 0.4661072 Fold03.Rep1 24 2 13.54879 0.5850439 Fold10.Rep3 25 2 14.10859 0.5994811 Fold06.Rep5 26 2 13.68329 0.6701091 Fold01.Rep5 27 2 16.12123 0.5401200 Fold10.Rep1 28 2 12.92250 0.6917220 Fold06.Rep3 29 2 12.94366 0.6400066 Fold06.Rep2 30 2 12.39889 0.6790578 Fold01.Rep2 31 2 13.48499 0.6759649 Fold01.Rep1 32 2 12.52938 0.6728476 Fold03.Rep4 33 2 16.43352 0.5795160 Fold09.Rep5 34 2 12.53991 0.6550694 Fold09.Rep4 35 2 12.78708 0.6304606 Fold08.Rep1 36 2 13.97559 0.6655688 Fold04.Rep3 37 2 15.31642 0.5124997 Fold09.Rep2 38 2 15.24194 0.5324943 Fold09.Rep1 39 2 12.90107 0.6318960 Fold04.Rep1 40 2 13.59574 0.6277869 Fold01.Rep4 41 2 19.73633 0.4154821 Fold07.Rep5 42 2 12.03759 0.6537381 Fold02.Rep5 43 2 15.47139 0.5597097 Fold02.Rep4 44 2 22.55060 0.3816672 Fold07.Rep3 45 2 14.57875 0.6269560 Fold07.Rep2 46 2 13.02385 0.6395148 Fold02.Rep2 47 2 13.81020 0.6116137 Fold02.Rep1 48 2 13.46100 0.6200828 Fold04.Rep4 49 2 13.95487 0.6709253 Fold10.Rep5 50 2 12.65981 0.6606435 Fold10.Rep4 Best, Zhenjiang [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] boxcox alternative
Michael, On Mon, Feb 24, 2014 at 5:51 AM, Michael Haenlein haenl...@escpeurope.eu wrote: Dear all, I am working with a set of variables that are very non-normally distributed. To improve the performance of my model, I'm currently applying a boxcox transformation to them. While this improves things, the performance is still not great. Are these predictors that you are transforming? So my question: Are there any alternatives to boxcox in R? I would need a model that estimates the best transformation automatically without input from the user since my approach should be flexible enough to deal with any kind of distribution. boxcox allows me to do this by picking the lambda that leads to the best fit but I wonder whether there are other options out there. If they are predictors, caret has a function called 'preProcess' that might interest you. See: http://caret.r-forge.r-project.org/preprocess.html#trans Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Predictor Importance in Random Forests and bootstrap
I think that the fundamental problem is that you are using the default value of ntree (500). You should always use at least 1500 and more if n or p are large. Also, this link will give you more up-to-date information on that package and feature selection: http://caret.r-forge.r-project.org/featureSelection.html Max On Tue, Jan 28, 2014 at 5:32 PM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Here is a great response I got from SO: There is an important difference between the two importance measures: MeanDecreaseAccuracy is calculated using out of bag (OOB) data, MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated on observations not used to form that particular tree. In contrast, MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are. It is calculated using the same data used to fit trees. When you bootstrap data, you are creating multiple copies of the same observations. Therefore the same observation can be split into two copies, one to form a tree, and one treated as OOB and used to calculate accuracy measures. Therefore, data that randomForest thinks is OOB for MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample, making the estimate of MeanDecreaseAccuracy overly optimistic in the bootstrap iterations. Gini index is immune to this, because it is not relying on evaluating importance on observations different from those used to fit the data. I suspect what you are trying to do is use the bootstrap to generate inference (p-values/confidence intervals) indicating which variables are important in the sense that they are actually predictive of your outcome. The bootstrap is not appropriate in this context, because Random Forests expects that OOB data is truly OOB and this is important for building the forest in the first place. In general, bootstrap is not universally applicable, and is only useful in cases where it can be shown that the parameter you're estimating has nice asymptotic properties and is not sensitive to ties in the data. A procedure like Random Forest which relies on the availability of OOB data is necessarily sensitive to ties. You may want to look at the caret package in R, which uses random forest (or one of a set of many other algorithms) inside a cross-validation loop to determine which variables are consistently important. See: http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Thank you, Bert. I'll definitely ask there. In the meantime I just wanted to ensure that my R code (my function for bootstrap and the bootstrap run) is correct and my abnormal bootstrap results are not a function of my erroneous code. Thank you! On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter gunter.ber...@gene.com wrote: I **think** this kind of methodological issue might be better at SO (stats.stackexchange.com). It's not really about R programming, which is the main focus of this list. And yes, I know they do intersect. Nevertheless... Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 Data is not information. Information is not knowledge. And knowledge is certainly not wisdom. H. Gilbert Welch On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Hello! Below, I: 1. Create a data set with a bunch of factors. All of them are predictors and 'y' is the dependent variable. 2. I run a classification Random Forests run with predictor importance. I look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini 3. I run 2 boostrap runs for 2 Random Forests measures of importance mentioned above. Question: Could anyone please explain why I am getting such a huge positive bias across the board (for all predictors) for MeanDecreaseAccuracy? Thanks a lot! Dimitri # # Creating a a data set: #- N-1000 myset1-c(1,2,3,4,5) probs1a-c(.05,.10,.15,.40,.30) probs1b-c(.05,.15,.10,.30,.40) probs1c-c(.05,.05,.10,.15,.65) myset2-c(1,2,3,4,5,6,7) probs2a-c(.02,.03,.10,.15,.20,.30,.20) probs2b-c(.02,.03,.10,.15,.20,.20,.30) probs2c-c(.02,.03,.10,.10,.10,.25,.40) myset.y-c(1,2) probs.y-c(.65,.30) set.seed(1) y-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) set.seed(2) a-as.factor(sample(myset1, N, replace = TRUE,probs1a)) set.seed(3) b-as.factor(sample(myset1, N, replace = TRUE,probs1b)) set.seed(4) c-as.factor(sample(myset1, N, replace = TRUE,probs1c)) set.seed(5) d-as.factor(sample(myset2, N, replace = TRUE,probs2a)) set.seed(6) e-as.factor(sample(myset2, N, replace = TRUE,probs2b))
Re: [R] R crashes with memory errors on a 256GB machine (and system shoes only 60GB usage)
Describing the problem would help a lot more. For example, if you were using some of the parallel processing options in R, this can make extra copies of objects and drive memory usage up very quickly. Max On Thu, Jan 2, 2014 at 3:35 PM, Ben Bolker bbol...@gmail.com wrote: Xebar Saram zeltakc at gmail.com writes: Hi All, I have a terrible issue i cant seem to debug which is halting my work completely. I have R 3.02 installed on a linux machine (arch linux-latest) which I built specifically for running high memory use models. the system is a 16 core, 256 GB RAM machine. it worked well at the start but in the recent days i keep getting errors and crashes regarding memory use, such as cannot create vector size of XXX, not enough memory etc when looking at top (linux system monitor) i see i barley scrape the 60 GB of ram (out of 256GB) i really don't know how to debug this and my whole work is halted due to this so any help would be greatly appreciated I'm very sympathetic, but it will be almost impossible to debug this sort of a problem remotely, without a reproducible example. The only guess that I can make, if you *really* are running *exactly* the same code as you previously ran successfully, is that you might have some very large objects hidden away in a saved workspace in a .RData file that's being loaded automatically ... I would check whether gc(), memory.profile(), etc. give sensible results in a clean R session (R --vanilla). Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Variable importance - ANN
If you are using the nnet package, the caret package has a variable importance method based on Gevrey, M., Dimopoulos, I., Lek, S. (2003). Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling, 160(3), 249-264. It is based on the estimated weights. Max On Wed, Dec 4, 2013 at 6:41 AM, Giulia Di Lauro giulia.dila...@gmail.comwrote: Hi everybody, I created a neural network for a regression analysis with package ANN, but now I need to know which is the significance of each predictor variable in explaining the dependent variable. I thought to analyze the weight, but I don't know how to do it. Thanks in advance, Giulia Di Lauro. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Inconsistent results between caret+kernlab versions
Andrew, What I still don't quite understand is which accuracy values from train() I should trust: those using classProbs=T or classProbs=F? It depends on whether you need the class probabilities and class predictions to match (which they would if classProbs = TRUE). Another option is to use a model where this discrepancy does not exist. train often crashes with 'memory map' errors!)? I've never seen that. You should describe it more. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Inconsistent results between caret+kernlab versions
Or not! The issue with with kernlab. Background: SVM models do not naturally produce class probabilities. A secondary model (via Platt) is fit to the raw model output and a logistic function is used to translate the raw SVM output to probability-like numbers (i.e. sum to zero, between 0 and 1). In ksvm(), you need to use the option prob.model = TRUE to get that second model. I discovered some time ago that there can be a discrepancy in the predicted classes that naturally come from the SVM model and those derived by using the class associated with the largest class probability. This is most likely do to natural error in the secondary probability model and should not be unexpected. That is the case for your data. In you use the same tuning parameters as those suggested by train() and go straight to ksvm(): newSVM - ksvm(x = as.matrix(df[,-1]), +y = df[,1], +kernel = rbfdot(sigma = svm.m1$bestTune$.sigma), +C = svm.m1$bestTune$.C, +prob.model = TRUE) predict(newSVM, df[43,-1]) [1] O32078 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676 predict(newSVM, df[43,-1], type = probabilities) O27479 O31403O32057O32059 O32060O32078 [1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394 O32089 O32663 O32668 O32676 [1,] 0.04890477 0.05210836 0.09838892 0.07284396 Note that, based on the probability model, the class with the largest probability is O32057 (p = 0.24) while the basic SVM model predicts O32078 (p = 0.16). Somebody (maybe me) saw this discrepancy and that led to me to follow this rule: if(prob.model = TRUE) use the class with the maximum probability else use the class prediction from ksvm(). Therefore: predict(svm.m1, df[43,-1]) [1] O32057 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676 That change occurred between the two caret versions that you tested with. (On a side note, can also occur with ksvm() and rpart() if cost-sensitive training is used because the class designation takes into account the costs but the class probability predictions do not. I alerted both package maintainers to the issue some time ago.) HTH, Max On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn mxk...@gmail.com wrote: I've looked into this a bit and the issue seems to be with caret. I've been looking at the svn check-ins and nothing stands out to me as the issue so far. The final models that are generated are the same and I'll try to figure out the difference. Two small notes: 1) you should set the seed to ensure reproducibility. 2) you really shouldn't use character stings with all numbers as factor levels with caret when you want class probabilities. It should give you a warning about this Max On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby andrewdi...@mac.com wrote: I'm using caret to assess classifier performance (and it's great!). However, I've found that my results differ between R2.* and R3.* - reported accuracies are reduced dramatically. I suspect that a code change to kernlab ksvm may be responsible (see version 5.16-24 here: http://cran.r-project.org/web/packages/caret/news.html). I get very different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + kernlab_0.9-19 (see below). Can anyone please shed any light on this? Thanks very much! ### To replicate: require(repmis) # For downloading from https df - source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', sep=',') require(caret) svm.m1 - train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv', number=10, repeats=10, classProbs=TRUE)) svm.m1 sessionInfo() ### Results - R2.15.2 svm.m1 1241 samples 7 predictors 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’ No pre-processing Resampling: Cross-Validation (10 fold, repeated 10 times) Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ... Resampling results across tuning parameters: C Accuracy Kappa Accuracy SD Kappa SD 0.25 0.684 0.63 0.0353 0.0416 0.5 0.729 0.685 0.0379 0.0445 1 0.756 0.716 0.0357 0.0418 Tuning parameter ‘sigma’ was held constant at a value of 0.247 Kappa was used to select the optimal model using the largest value. The final values used for the model were C = 1 and sigma = 0.247. sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-17 repmis_0.2.4 caret_5.15-61 reshape2_1.2.2 plyr_1.8lattice_0.20-10 foreach_1.4.0 cluster_1.14.3 loaded
Re: [R] C50 Node Assignment
There is a sub-object called 'rules' that has the output of C5.0 for this model: library(C50) mod - C5.0(Species ~ ., data = iris, rules = TRUE) cat(mod$rules) id=See5/C5.0 2.07 GPL Edition 2013-11-09 entries=1 rules=4 default=setosa conds=1 cover=50 ok=50 lift=2.94231 class=setosa type=2 att=Petal.Length cut=1.9 result= conds=3 cover=48 ok=47 lift=2.88 class=versicolor type=2 att=Petal.Length cut=1.9 result= type=2 att=Petal.Length cut=4.901 result= type=2 att=Petal.Width cut=1.7 result= conds=1 cover=46 ok=45 lift=2.875 class=virginica type=2 att=Petal.Width cut=1.7 result= conds=1 cover=46 ok=44 lift=2.8125 class=virginica type=2 att=Petal.Length cut=4.901 result= You would either have to parse this or parse the summary results: summary(mod) Call: C5.0.formula(formula = Species ~ ., data = iris, rules = TRUE) snip Rules: Rule 1: (50, lift 2.9) Petal.Length = 1.9 - class setosa [0.981] Rule 2: (48/1, lift 2.9) Petal.Length 1.9 Petal.Length = 4.9 Petal.Width = 1.7 - class versicolor [0.960] snip Max On Sat, Nov 9, 2013 at 1:11 PM, Carl Witthoft c...@witthoft.com wrote: Just to clarify: I'm guessing the OP is referring to the CRAN package C50 here. A quick skim suggests the rules are a list element of a C5.0-class object, so maybe that's where to start? David Winsemius wrote In my role as a moderator I am attempting to bypass the automatic mail filters that are blocking this posting. Please reply to the list and to: = Kevin Shaney lt; kevin.shaney@ gt; C50 Node Assignment I am using C50 to classify individuals into 5 groups / categories (factor variable). The tree / set of rules has 10 rules for classification. I am trying to extract the RULE for which each individual qualifies (a number between 1 and 10), and cannot figure out how to do so. I can extract the predicted group and predicted group probability, but not the RULE to which an individual qualifies. Please let me know if you can help! Kevin = -- David Winsemius Alameda, CA, USA __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/C50-Node-Assignment-tp4680071p4680127.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation in R
How do i make a loop so that the process could be repeated several time, producing randomly ROC curve and under ROC values? Using the caret package http://caret.r-forge.r-project.org/ -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error running caret's gbm train function with new version of caret
Katrina, I made some changes to accomidate gbm's new feature for 3+ categories, then had to harmonize how gbm and caret work together. I have a new version of caret that is not released yet (maybe within a month), but you should get it from: install.packages(caret, repos=http://R-Forge.R-project.org;) You may also need to ungrade gbm. That package page is: https://code.google.com/p/gradientboostedmodels/downloads/list Let me know if you have any issues. Max On Sat, May 4, 2013 at 5:33 PM, Katrina Bennett kebenn...@alaska.edu wrote: I am running caret for model exploration. I developed my code a number of months ago and I've been running it with no issues. Recently, I updated my version of caret however, and now I am getting a new error. I'm wondering if this is due to the new release. The error I am getting is when I am running GBM. print(paste(calculating GBM for, i)) #gbm runs over and over again set.seed(1) trainModelGBM - train(trainClass3, trainAsym, gbm, metric=RMSE, tuneLength = 5, trControl = con) The error I am getting is at the end of the run once all the iterations have been processed: Error in { : task 1 failed - arguments imply differing number of rows: 5, 121 trainClass3 and trainAsym have 311 values in them. I'm using 5 variables in my matrix. I'm not sure where the 117 is coming from. I found solutions online that suggested that updated the version of glmnet, Matrix and doing something with cv.folds would work. None of these solutions have worked for me. Here is my R session info. R version 2.15.1 (2012-06-22) Platform: x86_64-unknown-linux-gnu (64-bit) caret version 5.15-61 Thank you, Katrina [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] C50 package in R
There isn't much out there. Quinlan didn't open source the code until about a year ago. I've been through the code line by line and we have a fairly descriptive summary of the model in our book (that's almost out): http://appliedpredictivemodeling.com/ I will say that the pruning is mostly the same as described in Quinlan's C4.5 book. The big differences in C4.5 and C5.0 are boosting and winnowing. The former is very different mechanically than gradient boosting machines and is more similar to the re-weighting approach of the original adaboost algorithm (but is still pretty different). I've submitted a talk on C5.0 for this year's UseR! conference. If there is enough time I will be able to go through some of the technical details. Two other related notes: - the J48 implementation in Weka lacks one or two of C4.5's features that makes the results substantially different than what C4.5 would have produced The differences are significant enough that Quinlan asked us to call the results of that function as J48 and not C4.5. Using C5.0 with a single tree is much similar to C4.5 than J48. - the differences between model trees and Cubist are also substantial and largely undocumented. HTH, Max On Thu, Apr 25, 2013 at 9:40 AM, Indrajit Sen Gupta indrajit...@rediffmail.com wrote: Hi All, I am trying to use the C50 package to build classification trees in R. Unfortunately there is not enought documentation around its use. Can anyone explain to me - how to prune the decision trees? Regards, Indrajit [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave: Some questions about potential formatting options
Paul, #1: I've never tried but you might be able to escape the required tags in your text (e.g. in html you could write out the b in your text). #3: Which output? Is this in text? #2: I may be possible and maybe easy to implement. So if you want to dig into it, have at it. For me, I'm completely buried in the foreseeable future and won't be able to pay much attention to it. To be honest, odfWeave has been fairly neglected by me and lately I've had thoughts of orphaning the package :-/ Thanks, Max On Tue, Apr 16, 2013 at 1:15 PM, Paul Miller pjmiller...@yahoo.com wrote: Hi Milan and Max, Thanks to each of you for your reply to my post. Thus far, I've managed to find answers to some of the questions I asked initially. I am now able to control the justification of the leftmost column in my tables, as well as to add borders to the top and bottom. I also downloaded Milan's revised version of odfWeave at the link below, and found that it does a nice job of controlling column widths. http://nalimilan.perso.neuf.fr/transfert/odfWeave.tar.gz There are some other things I'm still struggling with though. 1. Is it possible to get odfTableCaption and odfFigureCaption to make the titles they produce bold? I understand it might be possible to accomplish this by changing something in the styles but am not sure what. If someone can give me a hint, I can likely do the rest. 2. Is there any way to get odfFigureCaption to put titles at the top of the figure instead of the bottom? I've noticed that odfTableCaption is able to do this but apparently not odfFigureCaption. 3. Is it possible to add special characters to the output? Below is a sample Kaplan-Meier analysis. There's a footnote in there that reads Note: X2(1) = xx.xx, p = .. Is there any way to make the X a lowercase Chi and to superscript the 2? I did quite a bit of digging on this topic. It sounds like it might be difficult, especially if one is using Windows as I am. Thanks, Paul ## Get data ## Load packages require(survival) require(MASS) Sample analysis attach(gehan) gehan.surv - survfit(Surv(time, cens) ~ treat, data= gehan, conf.type = log-log) print(gehan.surv) survTable - summary(gehan.surv)$table survTable - data.frame(Treatment = rownames(survTable), survTable, row.names=NULL) survTable - subset(survTable, select = -c(records, n.max)) ## odfWeave ## Load odfWeave require(odfWeave) Modify StyleDefs currentDefs - getStyleDefs() currentDefs$firstColumn$type - Table Column currentDefs$firstColumn$columnWidth - 5 cm currentDefs$secondColumn$type - Table Column currentDefs$secondColumn$columnWidth - 3 cm currentDefs$ArialCenteredBold$fontSize - 10pt currentDefs$ArialNormal$fontSize - 10pt currentDefs$ArialCentered$fontSize - 10pt currentDefs$ArialHighlight$fontSize - 10pt currentDefs$ArialLeftBold - currentDefs$ArialCenteredBold currentDefs$ArialLeftBold$textAlign - left currentDefs$cgroupBorder - currentDefs$lowerBorder currentDefs$cgroupBorder$topBorder - 0.0007in solid #00 setStyleDefs(currentDefs) Modify ImageDefs imageDefs - getImageDefs() imageDefs$dispWidth - 5.5 imageDefs$dispHeight- 5.5 setImageDefs(imageDefs) Modify Styles currentStyles - getStyles() currentStyles$figureFrame - frameWithBorders setStyles(currentStyles) Set odt table styles tableStyles - tableStyles(survTable, useRowNames = FALSE, header = ) tableStyles$headerCell[1,] - cgroupBorder tableStyles$header[,1] - ArialLeftBold tableStyles$text[,1] - ArialNormal tableStyles$cell[2,] - lowerBorder Weave odt source file fp - N:/Studies/HCRPC1211/Report/odfWeaveTest/ inFile - paste(fp, testWeaveIn.odt, sep=) outFile - paste(fp, testWeaveOut.odt, sep=) odfWeave(inFile, outFile) ## Contents of .odt source file ## Here is a sample Kaplan-Meier table. testKMTable, echo=FALSE, results = xml= odfTableCaption(A Sample Kaplan-Meier Analysis Table) odfTable(survTable, useRowNames = FALSE, digits = 3, colnames = c(Treatment, Number, Events, Median, 95% LCL, 95% UCL), colStyles = c(firstColumn, secondColumn, secondColumn, secondColumn, secondColumn, secondColumn), styles = tableStyles) odfCat(Note: X2(1) = xx.xx, p = .) @ Here is a sample Kaplan-Meier graph. testKMFig, echo=FALSE, fig = TRUE= odfFigureCaption(A Sample Kaplan-Meier Analysis Graph, label = Figure) plot(gehan.surv, xlab = Time, ylab= Survivorship) @ -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal,
Re: [R] Parallelizing GBM
See this: https://code.google.com/p/gradientboostedmodels/issues/detail?id=3 and this: https://code.google.com/p/gradientboostedmodels/source/browse/?name=parallel Max On Sun, Mar 24, 2013 at 7:31 AM, Lorenzo Isella lorenzo.ise...@gmail.comwrote: Dear All, I am far from being a guru about parallel programming. Most of the time, I rely or randomForest for data mining large datasets. I would like to give a try also to the gradient boosted methods in GBM, but I have a need for parallelization. I normally rely on gbm.fit for speed reasons, and I usually call it this way gbm_model - gbm.fit(trainRF,prices_train, offset = NULL, misc = NULL, distribution = multinomial, w = NULL, var.monotone = NULL, n.trees = 50, interaction.depth = 5, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, nTrain = (n_train/2), keep.data = FALSE, verbose = TRUE, var.names = NULL, response.name = NULL) Does anybody know an easy way to parallelize the model (in this case it means simply having 4 cores on the same machine working on the problem)? Any suggestion is welcome. Cheers Lorenzo __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CARET and NNET fail to train a model when the input is high dimensional
James, I did a fresh install from CRAN to get caret_5.15-61 and ran your code with method.name = nnet and grid.len = 3. I don't get an error, although there were issues: In nominalTrainWorkflow(dat = trainData, info = trainInfo, ... : There were missing values in resampled performance measures. The results had: Resampling results across tuning parameters: size decay ROCSens Spec ROC SD Sens SD Spec SD 1 0 0.521 0.52 0.521 0.0148 0.0312 0.00901 1 1e-04 0.513 0.528 0.498 0.00616 0.00386 0.00552 1 0.10.515 0.522 0.514 0.0169 0.0284 0.0426 3 0 NaNNaNNaNNA NA NA 3 1e-04 NaNNaNNaNNA NA NA 3 0.1NaNNaNNaNNA NA NA 5 0 NaNNaNNaNNA NA NA 5 1e-04 NaNNaNNaNNA NA NA 5 0.1NaNNaNNaNNA NA NA To test more, I ran: test - nnet(trX, trY, size = 3, decay = 0) Error in nnet.default(trX, trY, size = 3, decay = 0) : too many (2107) weights So, you need to pass in MaxNWts to nnet() with a value that let's you fit the model. Off the top of my head, you could use something like: MaxNWts = length(levels(trY))*(max(my.grid$.size) * (nCol + 1) + max(my.grid$.size) + 1) Also, this one of the methods for getting help (the other is to just email me). I also try to keep up on stack exchange too. Max On Tue, Mar 5, 2013 at 9:47 PM, James Jong ribonucle...@gmail.com wrote: The following code fails to train a nnet model in a random dataset using caret: nR - 700 nCol - 2000 myCtrl - trainControl(method=cv, number=3, preProcOptions=NULL, classProbs = TRUE, summaryFunction = twoClassSummary) trX - data.frame(replicate(nR, rnorm(nCol))) trY - runif(1)*trX[,1]*trX[,2]^2+runif(1)*trX[,3]/trX[,4] trY - as.factor(ifelse(sign(trY)0,'X1','X0')) my.grid - createGrid(method.name, grid.len, data=trX) my.model - train(trX,trY,method=method.name ,trace=FALSE,trControl=myCtrl,tuneGrid=my.grid, metric=ROC) print(Done) The error I get is: task 2 failed - arguments imply differing number of rows: 1334, 666 However, everything works if I reduce nR to, say 20. Any thoughts on what may be causing this? Is there a place where I could report this bug other than this mailing list? Here is my session info: sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] nnet_7.3-5 pROC_1.5.4 caret_5.15-052 foreach_1.4.0 [5] cluster_1.14.3 plyr_1.8reshape2_1.2.2 lattice_0.20-13 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6 [5] stringr_0.6.2 tools_2.15.2 Thanks, James [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret pls model statistics
That the most common formula, but not the only one. See Kvålseth, T. (1985). Cautionary note about $R^2$. *American Statistician*, *39*(4), 279285. Traditionally, the symbol 'R' is used for the Pearson correlation coefficient and one way to calculate R^2 is... R^2. Max On Sun, Mar 3, 2013 at 3:16 PM, Charles Determan Jr deter...@umn.eduwrote: I was under the impression that in PLS analysis, R2 was calculated by 1- (Residual sum of squares) / (Sum of squares). Is this still what you are referring to? I am aware of the linear R2 which is how well two variables are correlated but the prior equation seems different to me. Could you explain if this is the same concept? Charles On Sun, Mar 3, 2013 at 12:46 PM, Max Kuhn mxk...@gmail.com wrote: Is there some literature that you make that statement? No, but there isn't literature on changing a lightbulb with a duck either. Are these papers incorrect in using these statistics? Definitely, if they convert 3+ categories to integers (but there are specialized R^2 metrics for binary classification models). Otherwise, they are just using an ill-suited score. How would you explain such an R^2 value to someone? R^2 is a function of correlation between the two random variables. For two classes, one of them is binary. What does it mean? Historically, models rooted in computer science (eg neural networks) used RMSE or SSE to fit models with binary outcomes and that *can* work work well. However, I don't think that communicating R^2 is effective. Other metrics (e.g. accuracy, Kappa, area under the ROC curve, etc) are designed to measure the ability of a model to classify and work well. With 3+ categories, I tend to use Kappa. Max On Sun, Mar 3, 2013 at 10:53 AM, Charles Determan Jr deter...@umn.eduwrote: Thank you for your response Max. Is there some literature that you make that statement? I am confused as I have seen many publications that contain R^2 and Q^2 following PLSDA analysis. The analysis usually is to discriminate groups (ie. classification). Are these papers incorrect in using these statistics? Regards, Charles On Sat, Mar 2, 2013 at 10:39 PM, Max Kuhn mxk...@gmail.com wrote: Charles, You should not be treating the classes as numeric (is virginica really three times setosa?). Q^2 and/or R^2 are not appropriate for classification. Max On Sat, Mar 2, 2013 at 5:21 PM, Charles Determan Jr deter...@umn.eduwrote: I have discovered on of my errors. The timematrix was unnecessary and an unfortunate habit I brought from another package. The following provides the same R2 values as it should, however, I still don't know how to retrieve Q2 values. Any insight would again be appreciated: library(caret) library(pls) data(iris) #needed to convert to numeric in order to do regression #I don't fully understand this but if I left as a factor I would get an error following the summary function iris$Species=as.numeric(iris$Species) inTrain1=createDataPartition(y=iris$Species, p=.75, list=FALSE) training1=iris[inTrain1,] testing1=iris[-inTrain1,] ctrl1=trainControl(method=cv, number=10) plsFit2=train(Species~., data=training1, method=pls, trControl=ctrl1, metric=Rsquared, preProc=c(scale)) data(iris) training1=iris[inTrain1,] datvars=training1[,1:4] dat.sc=scale(datvars) pls.dat=plsr(as.numeric(training1$Species)~dat.sc, ncomp=3, method=oscorespls, data=training1) x=crossval(pls.dat, segments=10) summary(x) summary(plsFit2) Regards, Charles On Sat, Mar 2, 2013 at 3:55 PM, Charles Determan Jr deter...@umn.edu wrote: Greetings, I have been exploring the use of the caret package to conduct some plsda modeling. Previously, I have come across methods that result in a R2 and Q2 for the model. Using the 'iris' data set, I wanted to see if I could accomplish this with the caret package. I use the following code: library(caret) data(iris) #needed to convert to numeric in order to do regression #I don't fully understand this but if I left as a factor I would get an error following the summary function iris$Species=as.numeric(iris$Species) inTrain1=createDataPartition(y=iris$Species, p=.75, list=FALSE) training1=iris[inTrain1,] testing1=iris[-inTrain1,] ctrl1=trainControl(method=cv, number=10) plsFit2=train(Species~., data=training1, method=pls, trControl=ctrl1, metric=Rsquared, preProc=c(scale)) data(iris) training1=iris[inTrain1,] datvars=training1[,1:4] dat.sc=scale(datvars) n=nrow(dat.sc) dat.indices=seq(1,n) timematrix=with(training1, classvec2classmat(Species[dat.indices])) pls.dat=plsr(timematrix ~ dat.sc, ncomp=3, method=oscorespls, data=training1) x=crossval(pls.dat, segments=10) summary(x) summary(plsFit2) I see two different R2 values and I cannot figure out
Re: [R] caret pls model statistics
Charles, You should not be treating the classes as numeric (is virginica really three times setosa?). Q^2 and/or R^2 are not appropriate for classification. Max On Sat, Mar 2, 2013 at 5:21 PM, Charles Determan Jr deter...@umn.eduwrote: I have discovered on of my errors. The timematrix was unnecessary and an unfortunate habit I brought from another package. The following provides the same R2 values as it should, however, I still don't know how to retrieve Q2 values. Any insight would again be appreciated: library(caret) library(pls) data(iris) #needed to convert to numeric in order to do regression #I don't fully understand this but if I left as a factor I would get an error following the summary function iris$Species=as.numeric(iris$Species) inTrain1=createDataPartition(y=iris$Species, p=.75, list=FALSE) training1=iris[inTrain1,] testing1=iris[-inTrain1,] ctrl1=trainControl(method=cv, number=10) plsFit2=train(Species~., data=training1, method=pls, trControl=ctrl1, metric=Rsquared, preProc=c(scale)) data(iris) training1=iris[inTrain1,] datvars=training1[,1:4] dat.sc=scale(datvars) pls.dat=plsr(as.numeric(training1$Species)~dat.sc, ncomp=3, method=oscorespls, data=training1) x=crossval(pls.dat, segments=10) summary(x) summary(plsFit2) Regards, Charles On Sat, Mar 2, 2013 at 3:55 PM, Charles Determan Jr deter...@umn.edu wrote: Greetings, I have been exploring the use of the caret package to conduct some plsda modeling. Previously, I have come across methods that result in a R2 and Q2 for the model. Using the 'iris' data set, I wanted to see if I could accomplish this with the caret package. I use the following code: library(caret) data(iris) #needed to convert to numeric in order to do regression #I don't fully understand this but if I left as a factor I would get an error following the summary function iris$Species=as.numeric(iris$Species) inTrain1=createDataPartition(y=iris$Species, p=.75, list=FALSE) training1=iris[inTrain1,] testing1=iris[-inTrain1,] ctrl1=trainControl(method=cv, number=10) plsFit2=train(Species~., data=training1, method=pls, trControl=ctrl1, metric=Rsquared, preProc=c(scale)) data(iris) training1=iris[inTrain1,] datvars=training1[,1:4] dat.sc=scale(datvars) n=nrow(dat.sc) dat.indices=seq(1,n) timematrix=with(training1, classvec2classmat(Species[dat.indices])) pls.dat=plsr(timematrix ~ dat.sc, ncomp=3, method=oscorespls, data=training1) x=crossval(pls.dat, segments=10) summary(x) summary(plsFit2) I see two different R2 values and I cannot figure out how to get the Q2 value. Any insight as to what my errors may be would be appreciated. Regards, -- Charles -- Charles Determan Integrated Biosciences PhD Student University of Minnesota [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave: Trouble Getting the Package to Work
That's not a reproducible example. There is no sessionInfo() and you omitted code (where did 'fp' come from?). It works fine for me (see sessionInfo below) using the code in ?odfWeave. As for the file paths: you can point to different paths for the files (although don't change the working directory in the odt file). If you read the documentation for workDir: a path to a directory where the source file will be unpacked and processed. If it does not exist, it will be created. If it exists, it should be empty, since all its contents will be included in the generated file. The default value should be sufficient. Max sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] MASS_7.3-22 odfWeave_0.8.2 XML_3.95-0.1lattice_0.20-10 loaded via a namespace (and not attached): [1] tools_2.15.2 On Mon, Feb 18, 2013 at 8:52 AM, Paul Miller pjmiller...@yahoo.com wrote: Hello All, Have recently started learning Sweave and Knitr. Am now trying to learn odfWeave as well. Things went pretty smoothly with Sweave and Knitr but I'm having some trouble with odfWeave. My understanding was that odfWeave should work in pretty much the same way as Sweave. With odfWeave, you set up an input .odt file in a folder, run that file through the odfWeave function, and then the function produces an output .odt file in the same folder. So I decided to try that using a file called simple.odt that comes with the odfWeave package. Unfortunately, things didn't work out quite as I had hoped. Below is the result of my attempt to odfWeave that file via Emacs. For some reason, odfWeave is setting the wd to a location on the C drive when my input file is on the N drive. I tried altering this by setting the location of workDir to my folder on the N drive. odfWeave through up an error saying that this folder already exists. So perhaps the files are supposed to be processed in a location other than the one where the input file resides. The other thing is that odfWeave is finding an unexpected . There is text in the simple.odt input file that looks like paste(levels(iris$Species), collapse = but it has no . So presumably something is wrong in the xml markup that is being produced. If anyone can help me understand what is going wrong here, that would be greatly appreciated. Thanks, Paul library(odfWeave) Loading required package: lattice Loading required package: XML inFile - paste(fp, simple.odt, sep=) outFile - paste(fp, output.odt, sep=) odfWeave(inFile, outFile) Copying N:/Studies/HCRPC1211/Documentation/R Documentation/odfWeave Documentation/Examples/Example 1/simple.odt Setting wd to C:\Users\pmiller\AppData\Local\Temp\3\RtmpMlDMHV/odfWeave18071055703 Unzipping ODF file using unzip -o simple.odt Archive: simple.odt extracting: mimetype inflating: meta.xml inflating: settings.xml inflating: content.xml extracting: Thumbnails/thumbnail.png inflating: layout-cache inflating: manifest.rdf creating: Configurations2/popupmenu/ creating: Configurations2/images/Bitmaps/ creating: Configurations2/toolpanel/ creating: Configurations2/statusbar/ creating: Configurations2/toolbar/ creating: Configurations2/progressbar/ creating: Configurations2/menubar/ creating: Configurations2/floater/ inflating: Configurations2/accelerator/current.xml inflating: styles.xml inflating: META-INF/manifest.xml Removing simple.odt Creating a Pictures directory Pre-processing the contents Sweaving content.Rnw Writing to file content_1.xml Processing code chunks ... Error in parse(text = cmd) : text:1:40: unexpected '' 1: paste(levels(iris$Species), collapse = ^ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CARET: Any way to access other tuning parameters?
James, You really need to read the documentation. Almost every question that you have has been addressed in the existing material. For this one, there is a section on custom models here: http://caret.r-forge.r-project.org/training.html Max On Wed, Feb 13, 2013 at 9:58 AM, James Jong ribonucle...@gmail.com wrote: The documentation for caret::train shows a list of parameters that one can tune for each method classification/regression method. For example, for the method randomForest one can tune mtry in the call to train. But the function call to train random forests in the original package has many other parameters, e.g. sampsize, maxnodes, etc. Is there **any** way to access these parameters using train in caret? (Is the function caret::createGrid limited to the list of parameters specified in the caret documentation, it's not super clear if the list of parameter is for all the caret APIs). Thanks, James, [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CARET: Any way to access other tuning parameters?
@Max - Thanks a lot for your help. I have already been using that website as a reference, and it's incredibly helpful. I have also been experimenting with tuneGrid already. My question was specifically if tuneGrid (or caret in general) supports passing method parameters to the method functions from each package other than those listed in the CARET documentation (e.g. I would like to specify sampsize and nodesize for randomForest, and not just mtry). Yes. A custom method is how you do that. Thanks, James On Wed, Feb 13, 2013 at 1:07 PM, Max Kuhn mxk...@gmail.com wrote: James, You really need to read the documentation. Almost every question that you have has been addressed in the existing material. For this one, there is a section on custom models here: http://caret.r-forge.r-project.org/training.html Max On Wed, Feb 13, 2013 at 9:58 AM, James Jong ribonucle...@gmail.comwrote: The documentation for caret::train shows a list of parameters that one can tune for each method classification/regression method. For example, for the method randomForest one can tune mtry in the call to train. But the function call to train random forests in the original package has many other parameters, e.g. sampsize, maxnodes, etc. Is there **any** way to access these parameters using train in caret? (Is the function caret::createGrid limited to the list of parameters specified in the caret documentation, it's not super clear if the list of parameter is for all the caret APIs). Thanks, James, [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] pROC and ROCR give different values for AUC
A reproducible example sent to the package maintainer(s) might yield results. Max On Wed, Dec 19, 2012 at 7:47 AM, Ivana Cace i.c...@ati-a.nl wrote: Packages pROC and ROCR both calculate/approximate the Area Under (Receiver Operator) Curve. However the results are different. I am computing a new variable as a predictor for a label. The new variable is a (non-linear) function of a set of input values, and I'm checking how different parameter settings contribute to prediction. All my settings are predictive, but some are better. The AUC i got with pROC was much lower then expected, so i tried ROCR. Here are some comparisons: AUC from pROC AUC from ROCR 0.49465 0.79311 0.49465 0.79349 0.49701 0.79446 0.49701 0.79764 When i draw the ROC (with pROC) i get the curve i expect. But why is the AUC according to pROC so different? Ivana [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with this error kernlab class probability calculations failed; returning NAs
You didn't provide the results of sessionInfo(). Upgrade to the version just released on cran and see if you still have the issue. Max On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote: I have never been able to get class probabilities to work and I am relatively new to using these tools, and I am looking for some insight as to what may be wrong. I am using caret with kernlab/ksvm. I will simplify my problem to a basic data set which produces the same problem. I have read the caret vignettes as well as documentation for ?train. I appreciate any direction you can give. I realize this is a very small dataset, the actual data is much larger, I am just using 10 rows as an example: trainset - data.frame( outcome=factor(c(0,1,0,1,0,1,1,1,1,0)), age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9), amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2) ) str(trainset) 'data.frame': 7 obs. of 3 variables: $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1 $ age: num 23 5 28 48 82 11 9 $ amount : num 22.2 494.2 2 39.2 39.2 ... colSums(is.na(trainset)) outcome age amount 0 0 0 ## SAMPLING AND FORMULA dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] ## TUNE caret / kernlab set.seed(1) MyTrainControl=trainControl( method = repeatedcv, number=10, repeats=5, returnResamp = all, classProbs = TRUE ) ## MODEL rbfSVM - train(outcome~., data = trainset, method=svmRadial, preProc = c(scale), tuneLength = 10, trControl=MyTrainControl, fit = FALSE ) There were 50 or more warnings (use warnings() to see the first 50) warnings() Warning messages: 1: In train.default(x, y, weights = w, ...) : At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 2: In caret:::predictionFunction(method = method, modelFit = mod$fit, ... : kernlab class prediction calculations failed; returning NAs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with this error kernlab class probability calculations failed; returning NAs
Your output has: At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 Try changing the factor levels to avoid leading numbers and try again. Max On Thu, Nov 29, 2012 at 10:18 PM, Brian Feeny bfe...@mac.com wrote: Yes I am still getting this error, here is my sessionInfo: sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14 caret_5.15-045 foreach_1.4.0 cluster_1.14.3 [7] reshape_0.8.4 plyr_1.7.1 lattice_0.20-10 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6 tools_2.15.2 Is there an example that shows a classProbs example, I could try to run it to replicate and see if it works on my system. Brian On Nov 29, 2012, at 10:10 PM, Max Kuhn mxk...@gmail.com wrote: You didn't provide the results of sessionInfo(). Upgrade to the version just released on cran and see if you still have the issue. Max On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote: I have never been able to get class probabilities to work and I am relatively new to using these tools, and I am looking for some insight as to what may be wrong. I am using caret with kernlab/ksvm. I will simplify my problem to a basic data set which produces the same problem. I have read the caret vignettes as well as documentation for ?train. I appreciate any direction you can give. I realize this is a very small dataset, the actual data is much larger, I am just using 10 rows as an example: trainset - data.frame( outcome=factor(c(0,1,0,1,0,1,1,1,1,0)), age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9), amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2) ) str(trainset) 'data.frame': 7 obs. of 3 variables: $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1 $ age: num 23 5 28 48 82 11 9 $ amount : num 22.2 494.2 2 39.2 39.2 ... colSums(is.na(trainset)) outcome age amount 0 0 0 ## SAMPLING AND FORMULA dataset - trainset index - 1:nrow(dataset) testindex - sample(index, trunc(length(index)*30/100)) trainset - dataset[-testindex,] testset - dataset[testindex,-1] ## TUNE caret / kernlab set.seed(1) MyTrainControl=trainControl( method = repeatedcv, number=10, repeats=5, returnResamp = all, classProbs = TRUE ) ## MODEL rbfSVM - train(outcome~., data = trainset, method=svmRadial, preProc = c(scale), tuneLength = 10, trControl=MyTrainControl, fit = FALSE ) There were 50 or more warnings (use warnings() to see the first 50) warnings() Warning messages: 1: In train.default(x, y, weights = w, ...) : At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1 2: In caret:::predictionFunction(method = method, modelFit = mod$fit, ... : kernlab class prediction calculations failed; returning NAs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret train and trainControl
Brian, This is all outlined in the package documentation. The final model is fit automatically. For example, using 'verboseIter' provides details. From ?train knnFit1 - train(TrainData, TrainClasses, + method = knn, + preProcess = c(center, scale), + tuneLength = 10, + trControl = trainControl(method = cv, verboseIter = TRUE)) + Fold01: k= 5 - Fold01: k= 5 + Fold01: k= 7 - Fold01: k= 7 + Fold01: k= 9 - Fold01: k= 9 + Fold01: k=11 - Fold01: k=11 snip + Fold10: k=17 - Fold10: k=17 + Fold10: k=19 - Fold10: k=19 + Fold10: k=21 - Fold10: k=21 + Fold10: k=23 - Fold10: k=23 Aggregating results Selecting tuning parameters Fitting model on full training set Max On Fri, Nov 23, 2012 at 5:52 PM, Brian Feeny bfe...@mac.com wrote: I am used to packages like e1071 where you have a tune step and then pass your tunings to train. It seems with caret, tuning and training are both handled by train. I am using train and trainControl to find my hyper parameters like so: MyTrainControl=trainControl( method = cv, number=5, returnResamp = all, classProbs = TRUE ) rbfSVM - train(label~., data = trainset, method=svmRadial, tuneGrid = expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)), trControl=MyTrainControl, fit = FALSE ) Once this returns my ideal parameters, in this case Cost of 64, do I simply just re-run the whole process again, passing a grid only containing the specific parameters? like so? rbfSVM - train(label~., data = trainset, method=svmRadial, tuneGrid = expand.grid(.sigma=0.0118,.C=64), trControl=MyTrainControl, fit = FALSE ) This is what I have been doing but I am new to caret and want to make sure I am doing this correctly. Brian __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Decision Tree: Am I Missing Anything?
Vik, On Fri, Sep 21, 2012 at 12:42 PM, Vik Rubenfeld v...@mindspring.com wrote: Max, I installed C50. I have a question about the syntax. Per the C50 manual: ## Default S3 method: C5.0(x, y, trials = 1, rules= FALSE, weights = NULL, control = C5.0Control(), costs = NULL, ...) ## S3 method for class ’formula’ C5.0(formula, data, weights, subset, na.action = na.pass, ...) I believe I need the method for class 'formula'. But I don't yet see in the manual how to tell C50 that I want to use that method. If I run: respLevel = read.csv(Resp Level Data.csv) respLevelTree = C5.0(BRAND_NAME ~ PRI + PROM + REVW + MODE + FORM + FAMI + DRRE + FREC + SPED, data = respLevel) ...I get an error message: Error in gsub(:, ., x, fixed = TRUE) : input string 18 is invalid in this locale You're not doing it wrong. Can you send me the results of sessionInfo()? I think there are a few issues with the function on windows, so a reproducible example would help solve the issue. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Caret: Use timingSamps leads to error
I can reproduce the errors. I'll take a look. Thanks, Max On Thu, Jul 12, 2012 at 5:24 AM, Dominik Bruhn domi...@dbruhn.de wrote: I want to use the caret package and found out about the timingSamps obtion to obtain the time which is needed to predict results. But, as soon as I set a value for this option, the whole model generation fails. Check this example: - library(caret) tc=trainControl(method='LGOCV', timingSamps=10) tcWithout=trainControl(method='LGOCV') x=train(Volume~Girth+Height, method=lm, data=trees, trControl=tcWithout) x=train(Volume~Girth+Height, method=lm, data=trees, trControl=tc) Error in eval(expr, envir, enclos) : object 'Girth' not found Timing stopped at: 0 0 0.003 As you can see, the model generation works without the timingSamps option but fails if it is specified. What am I doing wrong? My sessioninfo: -- R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MASS_7.3-18caret_5.15-023 foreach_1.4.0 cluster_1.14.2 reshape_0.8.4 [6] plyr_1.7.1 lattice_0.20-6 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6 [5] tools_2.15.0 - Thanks! -- Dominik Bruhn mailto: domi...@dbruhn.de -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret() train based on cross validation - split dataset to keep sites together?
Tyrell, If you want to have the folds contain data from only one site at a time, you can develop a set of row indices and pass these to the index argument in trainControl. For example index = list(site1 = c(1, 6, 8, 12), site2 = c(120, 152, 176, 178), site3 = c(754, 789, 981)) The first fold would fit a model on those site 1 data in the first argument and predict everything else, and so on. I'm not sure if this is what you need, but there you go. Max On Wed, May 30, 2012 at 7:55 AM, Tyrell Deweber jtdewe...@gmail.com wrote: Hello all, I have searched and have not yet identified a solution so now I am sending this message. In short, I need to split my data into training, validation, and testing subsets that keep all observations from the same sites together – preferably as part of a cross validation procedure. Now for the longer version. And I must confess that although my R skills are improving, they are not so highly developed. I am using 10 fold cross validation with 3 repeats in the train function of the caret() package to identify an optimal nnet (neural network) model to predict daily river water temperature at unsampled sites. I am also withholding data from 10% of sites to have a better understanding of generalization error. However, the focus on predictions at other sites is turning out to be not easily facilitated – as far as I can see. My data structure (example at bottom of email) consists of columns identifying the site, the date, the water temperature on that day for the site (response variable), and many predictors. There are over 220,000 individual observations at ~1,000 sites, and each site has a minimum of 30 observations. It is important to keep sites separate because selecting a model based on predictions at an already sampled site is likely overly-optimistic. Is there a way to split data for (or preferably during) cross validation procedure to: 1.) Selects a separate validation dataset from 10% of sites 2.) Splits remaining training data into cross validation subsets and most importantly, keeping all observations from a site together 3.) Secondarily, constrain partitions to be similar - ideally based on distributions of all variables It seems that some combination of the sample.split function of the caTools() package and the createdataPartition function of caret() might do this, but I am at a loss for how to code that. If this is not possible, I would be content to skip the cross validation procedure and create three similar splits of my data that keep all observations from a site together – one for training, one for testing, and one for validation. The alternative goal here would be to split the data where 80% of sites are training, 10% of sites are for testing (model selection), and 10% of sites for validation. Thank you and please let me know if there are any remaining questions. This is my first post as well, so if I left anything out that would be good to know as well. Tyrell Deweber R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-redhat-linux-gnu (64-bit) Comid tempymd watmntemp airtemp predictorb … 15433 1980-05-01 11.4 22.1 … 15433 1980-05-02 11.6 23.6 … 15433 1980-05-03 11.2 28.5 15687 1980-06-01 13.5 26.5 15687 1980-06-02 14.2 26.9 15687 1980-06-03 13.8 28.9 18994 1980-04-05 8.4 16.4 18994 1980-04-06 8.3 12.6 90342 1980-07-13 18.9 22.3 90342 1980-07-14 19.3 28.4 EXAMPLE SCRIPT FOR MODEL FITTING fitControl - trainControl(method = repeatedcv, number=10, repeats=3) tuning - read.table(temptunegrid.txt,head=T,sep=,) tuning # # Model with 100 iterations registerDoMC(4) tempmod100its - train(watmntemp~tempa + tempb + tempc + tempd + tempe + netarea + netbuffor + strmslope + netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest + buffor + tempa7day + tempb7day + tempc7day + tempd7day + tempe7day + tempa30day + tempb30day + tempc30day + tempd30day + tempe30day, data = temp.train, method = nnet, linout=T, maxit = 100, MaxNWts = 10, metric = RMSE, trControl = fitControl, tuneGrid = tuning, trace = T) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide
Re: [R] caret: Error when using rpart and CV != LOOCV
Dominik, There are a number of formulations of this statistic (see the Kvålseth[*] reference below). I tend to think of R^2 as the proportion of variance explained by the model[**]. With the traditional formula, it is possible to get negative proportions (if there are extreme outliers in the predictions, the negative proportion can be very large). I used this formulation because it is always on (0, 1). It is called R^2 after all! Here is an example: set.seed(1) simObserved - rnorm(100) simPredicted - simObserved + rnorm(100)*.1 cor(simObserved, simPredicted)^2 [1] 0.9887525 customSummary(data.frame(obs = simObserved, + pred = simPredicted)) RMSE Rsquared 0.09538273 0.98860908 simPredicted[1] [1] -0.6884905 simPredicted[1] - 10 cor(simObserved, simPredicted)^2 [1] 0.3669257 customSummary(data.frame(obs = simObserved, + pred = simPredicted)) RMSE Rsquared 1.066900 -0.425169 It is somewhat extreme, but it does happen. Max * Kvålseth, T. (1985). Cautionary note about $R^2$. American statistician, 39(4), 279–285. * This is a very controversial statement when non-linear models are used. I'd rather use RMSE, but many scientists I work with still think in terms of R^2 regardless of the model. The randomForest function also computes this statistic, but calls it % Var explained instead of explicitly labeling it as R^2. This statistic has generated heated debates and I hope that I will not have to wear a scarlet R in Nashville in a few weeks. On Thu, May 17, 2012 at 1:35 PM, Dominik Bruhn domi...@dbruhn.de wrote: Hy Max, thanks again for the answer. I checked the caret implementation and you were right. If the predictions for the model constant (or sd(pred)==0) then the implementation returns a NA for the rSquare (in postResample). This is mainly because the caret implementation uses `cor` (from the stats-package) which would throw a error for values with sd(pred)==0. Do you know why this is implemented in this way? I wrote my own summaryFunction which calculates rSquare by hand and it works fine. It nevertheless does NOT(!) generate the same values as the original implementation. It seems that the calcuation of Rsquare does not seem to be consistent. I took mine from Wikipedia [1]. Here is my code: --- customSummary - function (data, lev = NULL, model = NULL) { #Calulate rSquare ssTot - sum((data$obs-mean(data$obs))^2) ssErr - sum((data$obs-data$pred)^2) rSquare - 1-(ssErr/ssTot) #Calculate MSE mse - mean((data$pred - data$obs)^2) #Aggregate out - c(sqrt(mse), 1-(ssErr/ssTot)) names(out) - c(RMSE, Rsquared) return(out) } --- [1]: http://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions Thanks! Dominik On 17/05/12 04:10, Max Kuhn wrote: Dominik, See this line: Min. 1st Qu. Median Mean 3rd Qu. Max. 30.37 30.37 30.37 30.37 30.37 30.37 The variance of the predictions is zero. caret uses the formula for R^2 by calculating the correlation between the observed data and the predictions which uses sd(pred) which is zero. I believe that the same would occur with other formulas for R^2. Max On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn domi...@dbruhn.de wrote: Thanks Max for your answer. First, I do not understand your post. Why is it a problem if two of predictions match? From the formula for calculating R^2 I can see that there will be a DivByZero iff the total sum of squares is 0. This is only true if the predictions of all the predicted points from the test-set are equal to the mean of the test-set. Why should this happen? Anyway, I wrote the following code to check what you tried to tell: -- library(caret) data(trees) formula=Volume~Girth+Height customSummary - function (data, lev = NULL, model = NULL) { print(summary(data$pred)) return(defaultSummary(data, lev, model)) } tc=trainControl(method='cv', summaryFunction=customSummary) train(formula, data=trees, method='rpart', trControl=tc) -- This outputs: --- Min. 1st Qu. Median Mean 3rd Qu. Max. 18.45 18.45 18.45 30.12 35.95 53.44 Min. 1st Qu. Median Mean 3rd Qu. Max. 22.69 22.69 22.69 32.94 38.06 53.44 Min. 1st Qu. Median Mean 3rd Qu. Max. 30.37 30.37 30.37 30.37 30.37 30.37 [cut many values like this] Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, : There were missing values in resampled performance measures. - As I didn't understand your post, I don't know if this confirms your assumption. Thanks anyway, Dominik On 16/05/12 17:30, Max Kuhn wrote: More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost
Re: [R] caret: Error when using rpart and CV != LOOCV
More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost every incarnation of R^2 requires the variance of the prediction. This particular failure mode would result in a divide by zero. Try using you own summary function (see ?trainControl) and put a print(summary(data$pred)) in there to verify my claim. Max On Wed, May 16, 2012 at 11:30 AM, Max Kuhn mxk...@gmail.com wrote: More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost every incarnation of R^2 requires the variance of the prediction. This particular failure mode would result in a divide by zero. Try using you own summary function (see ?trainControl) and put a print(summary(data$pred)) in there to verify my claim. Max On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn domi...@dbruhn.de wrote: Hy, I got the following problem when trying to build a rpart model and using everything but LOOCV. Originally, I wanted to used k-fold partitioning, but every partitioning except LOOCV throws the following warning: Warning message: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, : There were missing values in resampled performance measures. - Below are some simplified testcases which repoduce the warning on my system. Question: What does this error mean? How can I avoid it? System-Information: - sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rpart_3.1-52 caret_5.15-023 foreach_1.4.0 cluster_1.14.2 reshape_0.8.4 [6] plyr_1.7.1 lattice_0.20-6 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6 [5] tools_2.15.0 --- Simlified Testcase I: Throws warning --- library(caret) data(trees) formula=Volume~Girth+Height train(formula, data=trees, method='rpart') --- Simlified Testcase II: Every other CV-method also throws the warning, for example using 'cv': --- library(caret) data(trees) formula=Volume~Girth+Height tc=trainControl(method='cv') train(formula, data=trees, method='rpart', trControl=tc) --- Simlified Testcase III: The only CV-method which is working is 'LOOCV': --- library(caret) data(trees) formula=Volume~Girth+Height tc=trainControl(method='LOOCV') train(formula, data=trees, method='rpart', trControl=tc) --- Thanks! -- Dominik Bruhn mailto: domi...@dbruhn.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret: Error when using rpart and CV != LOOCV
Dominik, See this line: Min. 1st Qu. Median Mean 3rd Qu. Max. 30.37 30.37 30.37 30.37 30.37 30.37 The variance of the predictions is zero. caret uses the formula for R^2 by calculating the correlation between the observed data and the predictions which uses sd(pred) which is zero. I believe that the same would occur with other formulas for R^2. Max On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn domi...@dbruhn.de wrote: Thanks Max for your answer. First, I do not understand your post. Why is it a problem if two of predictions match? From the formula for calculating R^2 I can see that there will be a DivByZero iff the total sum of squares is 0. This is only true if the predictions of all the predicted points from the test-set are equal to the mean of the test-set. Why should this happen? Anyway, I wrote the following code to check what you tried to tell: -- library(caret) data(trees) formula=Volume~Girth+Height customSummary - function (data, lev = NULL, model = NULL) { print(summary(data$pred)) return(defaultSummary(data, lev, model)) } tc=trainControl(method='cv', summaryFunction=customSummary) train(formula, data=trees, method='rpart', trControl=tc) -- This outputs: --- Min. 1st Qu. Median Mean 3rd Qu. Max. 18.45 18.45 18.45 30.12 35.95 53.44 Min. 1st Qu. Median Mean 3rd Qu. Max. 22.69 22.69 22.69 32.94 38.06 53.44 Min. 1st Qu. Median Mean 3rd Qu. Max. 30.37 30.37 30.37 30.37 30.37 30.37 [cut many values like this] Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, : There were missing values in resampled performance measures. - As I didn't understand your post, I don't know if this confirms your assumption. Thanks anyway, Dominik On 16/05/12 17:30, Max Kuhn wrote: More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost every incarnation of R^2 requires the variance of the prediction. This particular failure mode would result in a divide by zero. Try using you own summary function (see ?trainControl) and put a print(summary(data$pred)) in there to verify my claim. Max On Wed, May 16, 2012 at 11:30 AM, Max Kuhn mxk...@gmail.com wrote: More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost every incarnation of R^2 requires the variance of the prediction. This particular failure mode would result in a divide by zero. Try using you own summary function (see ?trainControl) and put a print(summary(data$pred)) in there to verify my claim. Max On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn domi...@dbruhn.de wrote: Hy, I got the following problem when trying to build a rpart model and using everything but LOOCV. Originally, I wanted to used k-fold partitioning, but every partitioning except LOOCV throws the following warning: Warning message: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, : There were missing values in resampled performance measures. - Below are some simplified testcases which repoduce the warning on my system. Question: What does this error mean? How can I avoid it? System-Information: - sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rpart_3.1-52 caret_5.15-023 foreach_1.4.0 cluster_1.14.2 reshape_0.8.4 [6] plyr_1.7.1 lattice_0.20-6 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6 [5] tools_2.15.0 --- Simlified Testcase I: Throws warning --- library(caret) data(trees) formula=Volume~Girth+Height train(formula, data=trees, method='rpart') --- Simlified Testcase II: Every other CV-method also throws the warning, for example using 'cv': --- library(caret) data(trees) formula=Volume~Girth+Height tc=trainControl(method='cv') train(formula, data=trees, method='rpart', trControl=tc) --- Simlified Testcase III: The only CV-method which is working is 'LOOCV': --- library(caret) data(trees) formula=Volume~Girth+Height tc=trainControl(method='LOOCV') train(formula, data=trees, method='rpart', trControl=tc) --- Thanks! -- Dominik Bruhn mailto: domi
Re: [R] caret package: custom summary function in trainControl doesn't work with oob?
Matt, I've been using a custom summary function to optimise regression model methods using the caret package. This has worked smoothly. I've been using the default bootstrapping resampling method. For bagging models (specifically randomForest in this case) caret can, in theory, uses the out-of-bag (oob) error estimate from the model instead of resampling, which (in theory) is largely redundant for such models. Since they take a while to build in the first place, it really slows things down when estimating performance using boostrap. I can successfully run either using the oob 'resampling method' with the default RMSE optimisation, or run using bootstrap and my custom summaryFunction as the thing to optimise, but they don't work together. If I try and use oob and supply a summaryFunction caret throws an error saying it can't find the relevant metric. Now, if caret is simply polling the randomForest object for the stored oob error I can understand this limitation That is exactly what it does. See caret:::rfStats (not a public function) train() was written to be fairly general and this level of control would be very difficult to implement, especially since each model that does some type of bagging uses different internal structures etc. but in the case of randomForest (and probably other bagging methods?) the training function can be asked to return information about the individual tree predictions and whether data points were oob in each case. With this information you can reconstruct an oob 'error' using whatever function you choose to target for optimisation. As far as I can tell, caret is not doing this and I can't see anywhere that it can be coerced to do so. It will not be able to do this. I'm not sure that you can either. randomForest() will return the individual forests and predict.randomForest() can return the per-tree results but I don't know if it saves the indices that tell you which bootstrap samples contained which training set points. Perhaps Andy would know. Have I missed something? Can anyone suggest how this could be achieved? It wouldn't be *that* hard to code up something that essentially operates in the same way as caret.train but can handle this feature for bagging models, but if it is already there and I've missed something please let me know. Well, everything is easy for the person not doing it =] If you save the proximity measures, you might gain the sampling indices. WIth these, you would use predict.randomForest(..., predict.all=TRUE) to get the individual predictions. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] nonparametric densities for bounded distributions
Can anyone recommend a good nonparametric density approach for data bounded (say between 0 and 1)? For example, using the basic Gaussian density approach doesn't generate a very realistic shape (nor should it): set.seed(1) dat - rbeta(100, 1, 2) plot(density(dat)) (note the area outside of 0/1) The data I have may be bimodal or have other odd properties (e.g. point mass at zero). I've tried transforming via the logit, estimating the density then plotting the curve in the original units, but this seems to do poorly in the tails (and I have data are absolute zero and one). Thanks, Max [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Custom caret metric based on prob-predictions/rankings
I think you need to read the man pages and the four vignettes. A lot of your questions have answers there. If you don't specify the resampling indices, they ones generated for you are saved in the train object: data(iris) TrainData - iris[,1:4] TrainClasses - iris[,5] knnFit1 - train(TrainData, TrainClasses, + method = knn, + preProcess = c(center, scale), + tuneLength = 10, + trControl = trainControl(method = cv)) Loading required package: class Attaching package: ‘class’ The following object(s) are masked from ‘package:reshape’: condense Warning message: executing %dopar% sequentially: no parallel backend registered str(knnFit1$control$index) List of 10 $ Fold01: int [1:135] 1 2 3 4 5 6 7 9 10 11 ... $ Fold02: int [1:135] 1 2 3 4 5 6 8 9 10 12 ... $ Fold03: int [1:135] 1 3 4 5 6 7 8 9 10 11 ... $ Fold04: int [1:135] 1 2 3 5 6 7 8 9 10 11 ... $ Fold05: int [1:135] 1 2 3 4 6 7 8 9 11 12 ... $ Fold06: int [1:135] 1 2 3 4 5 6 7 8 9 10 ... $ Fold07: int [1:135] 1 2 3 4 5 7 8 9 10 11 ... $ Fold08: int [1:135] 2 3 4 5 6 7 8 9 10 11 ... $ Fold09: int [1:135] 1 2 3 4 5 6 7 8 9 10 ... $ Fold10: int [1:135] 1 2 4 5 6 7 8 10 11 12 ... There is also a savePredictions argument that gives you the hold-out results. I'm not sure which weights you are referring to. On Fri, Feb 10, 2012 at 4:38 AM, Yang Zhang yanghates...@gmail.com wrote: Actually, is there any way to get at additional information beyond the classProbs? In particular, is there any way to find out the associated weights, or otherwise the row indices into the original model matrix corresponding to the tested instances? On Thu, Feb 9, 2012 at 4:37 PM, Yang Zhang yanghates...@gmail.com wrote: Oops, found trainControl's classProbs right after I sent! On Thu, Feb 9, 2012 at 4:30 PM, Yang Zhang yanghates...@gmail.com wrote: I'm dealing with classification problems, and I'm trying to specify a custom scoring metric (recall@p, ROC, etc.) that depends on not just the class output but the probability estimates, so that caret::train can choose the optimal tuning parameters based on this metric. However, when I supply a trainControl summaryFunction, the data given to it contains only class predictions, so the only metrics possible are things like accuracy, kappa, etc. Is there any way to do this that I'm looking? If not, could I put this in as a feature request? Thanks! -- Yang Zhang http://yz.mit.edu/ -- Yang Zhang http://yz.mit.edu/ -- Yang Zhang http://yz.mit.edu/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Choosing glmnet lambda values via caret
You can adjust the candidate set of tuning parameters via the tuneGrid argument in trian() and the process by which the optimal choice is made (via the 'selectionFunction' argument in trainControl()). Check out the package vignettes. The latest version also has an update.train() function that lets the user manually specify the tuning parameters after the call to train(). On Thu, Feb 9, 2012 at 7:00 PM, Yang Zhang yanghates...@gmail.com wrote: Usually when using raw glmnet I let the implementation choose the lambdas. However when training via caret::train the lambda values are predetermined. Is there any way to have caret defer the lambda choices to caret::train and thus choose the optimal lambda dynamically? -- Yang Zhang http://yz.mit.edu/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] lattice key in blank panel
Somewhere I've seen an example of an xyplot() where the key was placed in a location of a missing panel. For example, if there were 3 conditioning levels, the panel grid would look like: 34 12 In this (possibly imaginary) example, there were scatter plots in locations 1:3 and location 4 had no conditioning bar at the top, only the key. I can find examples of putting the legend outside of the panel locations (e.g to the right of locations 2 and 4 above), but that's not really what I'd like to do. Thanks, Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] palettes for the color-blind
Everyone, I'm working with scatter plots with different colored symbols (via lattice). I'm currently using these colors for points and lines: col1 - c(rgb(1, 0, 0), rgb(0, 0, 1), rgb(0, 1, 0), rgb(0.55482458, 0.40350876, 0.0416), rgb(0, 0, 0)) plot(seq(along = col1), pch = 16, col = col1, cex = 1.5) I'm also using these with transparency (alpha between .5-.8 depending on the number of points). I'd like to make sure that these colors are interpretable by the color bind. Doing a little looking around, this might be a good palette: col2 - c(rgb(0, 0.4470588, 0.6980392), rgb(0.8352941, 0.3686275, 0, ), rgb(0.800, 0.4745098, 0.6549020), rgb(0.1686275, 0.6235294, 0.4705882), rgb(0.9019608, 0.6235294, 0.000)) plot(seq(along = col2), pch = 16, col = col2, cex = 1.5) but to be honest, I'd like to use something a little more vibrant. First, can anyone verify that these the colors in col2 are differentiable to someone who is color blind? Second, are there any other specific palettes that can be recommended? How do the RColorBrewer palettes rate in this respect? Thanks, Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] palettes for the color-blind
Yes, I was aware of the different type and their respective prevalences. The dichromat package helped me find what I needed. Thanks, Max On Wed, Nov 2, 2011 at 6:38 PM, Thomas Lumley tlum...@uw.edu wrote: On Thu, Nov 3, 2011 at 11:04 AM, Carl Witthoft c...@witthoft.com wrote: Before you pick out a palette: you are aware that their are several different types of color-blindness, aren't you? Yes, but to first approximation there are only two, and they have broadly similar, though not identical impact on choice of color palettes. The dichromat package knows about them, and so does Professor Brewer. More people will be unable to read your graphs due to some kind of gross visual impairment (cataracts, uncorrected focusing problems, macular degeneration, etc) than will have tritanopia or monochromacy. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help with parallel processing code
I'm not sure what you mean by full code or the iteration. This uses foreach to parallelize the loops over different tuning parameters and resampled data sets. The only way I could set to split up the parallelism is if you are fitting different models to the same data. In that case, you could launch separate jobs for each model. If the data is large and quickly read from disk, that might be better than storing it in memory and sequentially running models in the same script. We have decent sized machines here, so we launch different jobs per model and then parallelize each (even if it is using 2-3 cores it helps). Thanks, Max On Fri, Oct 28, 2011 at 10:49 AM, 1Rnwb sbpuro...@gmail.com wrote: the part of the question dawned on me now is, should I try to do the parallel processing of the full code or only the iteration part? if it is full code then I am at the complete mercy of the R help community or I giveup on this and let the computation run the serial way, which is continuing from past sat. Sharad -- View this message in context: http://r.789695.n4.nabble.com/help-with-parallel-processing-code-tp3944303p3948118.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Contrasts with an interaction. How does one specify the dummy variables for the interaction
This is failing because it is a saturated model and the contrast package tries to do a t-test (instead of a z test). I can add code to do this, but it will take a few days. Max On Fri, Oct 28, 2011 at 2:16 PM, John Sorkin jsor...@grecc.umaryland.edu wrote: Forgive my resending this post. To data I have received only one response (thank you Bert Gunter), and I still do not have an answer to my question. Respectfully, John Windows XP R 2.12.1 contrast package. I am trying to understand how to create contrasts for a model that contatains an interaction. I can get contrasts to work for a model without interaction, but not after adding the interaction. Please see code below. The last two contrast statements show the problem. I would appreciate someone letting me know what is wrong with the syntax of my contrast statements. Thank you, John library(contrast) # Create 2x2 contingency table. counts=c(50,50,30,70) row - gl(2,2,4) column - gl(2,1,4) mydata - data.frame(row,column,counts) print(mydata) # Show levels of 2x2 table levels(mydata$row) levels(mydata$column) # Models, no interaction, and interaction fitglm0 - glm(counts ~ row + column, family=poisson(link=log)) fitglm - glm(counts ~ row + column + row*column, family=poisson(link=log)) # Contrasts for model without interaction works fine! anova(fitglm0) summary(fitglm0) con0-contrast(fitglm0,list(row=1,column=1)) print(con0,X=TRUE) # Contrast for model with interaction does not work. anova(fitglm) summary(fitglm) con-contrast(fitglm,list(row=1,column=1) print(con,X=TRUE) # Nor does this work. con-contrast(fitglm,list(row=1,column=1,row:column=c(0,0))) print(con,X=TRUE) John David Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 (Phone) 410-605-7119 (Fax) 410-605-7913 (Please call phone number above prior to faxing) Confidentiality Statement: This email message, including any attachments, is for ...{{dropped:16}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help with parallel processing code
I have had issues with some parallel backends not finding functions within a namespace for packages listed in the .packages argument or explicitly loaded in the body of the foreach loop. This has occurred with MPI but not with multicore. I can get around this to some extent by calling the functions using the namespace (eg foo:::bar) but this is pretty kludgy. sessionInfo() R version 2.13.2 (2011-09-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] doMPI_0.1-5 Rmpi_0.5-9 doMC_1.2.3 multicore_0.1-7 foreach_1.3.2 codetools_0.2-8 iterators_1.0.5 Max On Thu, Oct 27, 2011 at 4:30 PM, 1Rnwb sbpuro...@gmail.com wrote: If i understand correctly you mean to write the line as below: foreach(icount(itr),.combine=combine,.options.smp=smpopts,.packages='MASS')%dopar% -- View this message in context: http://r.789695.n4.nabble.com/help-with-parallel-processing-code-tp3944303p3945954.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] difference between createPartition and createfold functions
No, it is an argument to createFolds. Type ?createFolds to see the appropriate syntax: returnTrain a logical. When true, the values returned are the sample positions corresponding to the data used during training. This argument only works in conjunction with list = TRUE On Mon, Oct 3, 2011 at 11:10 AM, bby2...@columbia.edu wrote: Hi Max, Thanks for the note. In your last paragraph, did you mean in createDataPartition? I'm a little vague about what returnTrain option does. Bonnie Quoting Max Kuhn mxk...@gmail.com: Basically, createDataPartition is used when you need to make one or more simple two-way splits of your data. For example, if you want to make a training and test set and keep your classes balanced, this is what you could use. It can also make multiple splits of this kind (or leave-group-out CV aka Monte Carlos CV aka repeated training test splits). createFolds is exclusively for k-fold CV. Their usage is simular when you use the returnTrain = TRUE option in createFolds. Max On Sun, Oct 2, 2011 at 4:00 PM, Steve Lianoglou mailinglist.honey...@gmail.com wrote: Hi, On Sun, Oct 2, 2011 at 3:54 PM, bby2...@columbia.edu wrote: Hi Steve, Thanks for the note. I did try the example and the result didn't make sense to me. For splitting a vector, what you describe is a big difference btw them. For splitting a dataframe, I now wonder if these 2 functions are the wrong choices. They seem to split the columns, at least in the few things I tried. Sorry, I'm a bit confused now as to what you are after. You don't pass in a data.frame into any of the createFolds/DataPartition functions from the caret package. You pass in a *vector* of labels, and these functions tells you which indices into the vector to use as examples to hold out (or keep (depending on the value you pass in for the `returnTrain` argument)) between each fold/partition of your learning scenario (eg. cross validation with createFolds). You would then use these indices to keep (remove) the rows of a data.frame, if that is how you are storing your examples. Does that make sense? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] difference between createPartition and createfold functions
Basically, createDataPartition is used when you need to make one or more simple two-way splits of your data. For example, if you want to make a training and test set and keep your classes balanced, this is what you could use. It can also make multiple splits of this kind (or leave-group-out CV aka Monte Carlos CV aka repeated training test splits). createFolds is exclusively for k-fold CV. Their usage is simular when you use the returnTrain = TRUE option in createFolds. Max On Sun, Oct 2, 2011 at 4:00 PM, Steve Lianoglou mailinglist.honey...@gmail.com wrote: Hi, On Sun, Oct 2, 2011 at 3:54 PM, bby2...@columbia.edu wrote: Hi Steve, Thanks for the note. I did try the example and the result didn't make sense to me. For splitting a vector, what you describe is a big difference btw them. For splitting a dataframe, I now wonder if these 2 functions are the wrong choices. They seem to split the columns, at least in the few things I tried. Sorry, I'm a bit confused now as to what you are after. You don't pass in a data.frame into any of the createFolds/DataPartition functions from the caret package. You pass in a *vector* of labels, and these functions tells you which indices into the vector to use as examples to hold out (or keep (depending on the value you pass in for the `returnTrain` argument)) between each fold/partition of your learning scenario (eg. cross validation with createFolds). You would then use these indices to keep (remove) the rows of a data.frame, if that is how you are storing your examples. Does that make sense? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave: Combining multiple output statements in a function
formatting.odf, page 7. The results are in formattingOut.odt On Thu, Sep 15, 2011 at 2:44 PM, Jan van der Laan rh...@eoos.dds.nl wrote: Max, Thank you for your answer. I have had another look at the examples (I already had before mailing the list), but could find the example you mention. Could you perhaps tell me which example I should have a look at? Regards, Jan On 09/15/2011 04:47 PM, Max Kuhn wrote: There are examples in the package directory that explain this. On Thu, Sep 15, 2011 at 8:16 AM, Jan van der Laanrh...@eoos.dds.nl wrote: What is the correct way to combine multiple calls to odfCat, odfItemize, odfTable etc. inside a function? As an example lets say I have a function that needs to write two paragraphs of text and a list to the resulting odf-document (the real function has much more complex logic, but I don't think thats relevant). My first guess would be: exampleOutput- function() { odfCat(This is the first paragraph) odfCat(This is the second paragraph) odfItemize(letters[1:5]) } However, calling this function in my odf-document only generates the last list as only the output of the odfItemize function is returned by exampleOutput. How do I combine the three results into one to be returned by exampleOutput? I tried to wrap the calls to the odf* functions into a print statement: exampleOutput2- function() { print(odfCat(This is the first paragraph)) print(odfCat(This is the second paragraph)) print(odfItemize(letters[1:5])) } In another document this seemed to work, but in my current document strange odf-output is generated. Regards, Jan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave: Combining multiple output statements in a function
There are examples in the package directory that explain this. On Thu, Sep 15, 2011 at 8:16 AM, Jan van der Laan rh...@eoos.dds.nl wrote: What is the correct way to combine multiple calls to odfCat, odfItemize, odfTable etc. inside a function? As an example lets say I have a function that needs to write two paragraphs of text and a list to the resulting odf-document (the real function has much more complex logic, but I don't think thats relevant). My first guess would be: exampleOutput - function() { odfCat(This is the first paragraph) odfCat(This is the second paragraph) odfItemize(letters[1:5]) } However, calling this function in my odf-document only generates the last list as only the output of the odfItemize function is returned by exampleOutput. How do I combine the three results into one to be returned by exampleOutput? I tried to wrap the calls to the odf* functions into a print statement: exampleOutput2 - function() { print(odfCat(This is the first paragraph)) print(odfCat(This is the second paragraph)) print(odfItemize(letters[1:5])) } In another document this seemed to work, but in my current document strange odf-output is generated. Regards, Jan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Trying to extract probabilities in CARET (caret) package with a glmStepAIC model
Can you provide a reproducible example and the results of sessionInfo()? What are the levels of your classes? On Sat, Aug 27, 2011 at 10:43 PM, Jon Toledo tintin...@hotmail.com wrote: Dear developers, I have jutst started working with caret and all the nice features it offers. But I just encountered a problem: I am working with a dataset that include 4 predictor variables in Descr and a two-category outcome in Categ (codified as a factor). Everything was working fine I got the results, confussion matrix etc. BUT for obtaining the AUC and predicted probabilities I had to add classProbs = TRUE, in the trainControl. Thereafter everytime I run train I get this message: undefined columns selected I copy the syntax: fitControl - trainControl(method = cv, number = 10, classProbs = TRUE,returnResamp = all, verboseIter = FALSE) glmFit - train(Descr, Categ, method = glmStepAIC,tuneLength = 4,trControl = fitControl) Thank you. Best regards, Jon Toledo, MD Postdoctoral fellow University of Pennsylvania School of Medicine Center for Neurodegenerative Disease Research 3600 Spruce Street 3rd Floor Maloney Building Philadelphia, Pa 19104 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] aucRoc in caret package [SEC=UNCLASSIFIED]
David, The ROC curve should really be computed with some sort of numeric data (as opposed to classes). It varies the cutoff to get a continuum of sensitivity and specificity values. Using the classes as 1's and 2's implies that the second class is twice the value of the first, which doesn't really make sense. Try getting the class probabilities for predicted1 and predicted2 and use those instead. Thanks, Max On Wed, Jun 1, 2011 at 9:24 PM, jin...@ga.gov.au wrote: Please note that predicted1 and predicted2 are two sets of predictions instead of predictors. As you can see the predictions with only two levels, 1 is for hard and 2 for soft. I need to assess which one is more accurate. Hope this is clear now. Thanks. Jin -Original Message- From: David Winsemius [mailto:dwinsem...@comcast.net] Sent: Thursday, 2 June 2011 10:55 AM To: Li Jin Cc: R-help@r-project.org Subject: Re: [R] aucRoc in caret package [SEC=UNCLASSIFIED] Using AUC for discrete predictor variables with inly two levels doesn't seem very sensible. What are you planning to to with this measure? -- David. On Jun 1, 2011, at 8:47 PM, jin...@ga.gov.au jin...@ga.gov.au wrote: Hi all, I used the following code and data to get auc values for two sets of predictions: library(caret) table(predicted1, trainy) trainy hard soft 1 27 0 2 11 99 aucRoc(roc(predicted1, trainy)) [1] 0.5 table(predicted2, trainy) trainy hard soft 1 27 2 2 11 97 aucRoc(roc(predicted2, trainy)) [1] 0.8451621 predicted1: 1 1 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 predicted2: 1 1 2 1 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 trainy: hard hard hard soft soft hard hard hard hard soft soft soft soft soft soft hard soft soft soft soft soft soft hard soft soft soft soft soft soft soft soft soft hard soft soft soft soft soft hard soft soft soft soft hard hard soft soft soft hard soft hard soft soft soft soft soft hard soft soft soft soft soft soft soft soft hard soft soft soft soft soft hard soft soft soft soft soft soft soft hard soft soft soft hard hard hard hard hard soft soft hard hard hard soft hard soft soft soft hard hard soft soft soft soft soft hard hard hard hard hard hard hard soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft soft hard soft soft soft soft soft soft soft soft Levels: hard soft Sys.info() sysname release version nodename Windows XP build 2600, Service Pack 3 PC-60772 machine x86 I would expect predicted1 is more accurate that the predicted2. But the auc values show an opposite. I was wondering whether this is a bug or I have done something wrong. Thanks for your help in advance! Cheers, Jin Jin Li, PhD Spatial Modeller/Computational Statistician Marine Coastal Environment Geoscience Australia GPO Box 378, Canberra, ACT 2601, Australia Ph: 61 (02) 6249 9899; email: jin...@ga.gov.aumailto:jin...@ga.gov.au ___ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] issue with odfWeave running on Windows XP; question about installing packages under Linux
Sorry for the delayed response. An upgrade of the XML package has broken odfWeave; see this thread: https://stat.ethz.ch/pipermail/r-help/2011-May/278063.html That may be your issue. We're working on the problem now. I'll post to R-Packages when we have a working update. If you like, I can send you the eventual fixes if you would like to test them. Thanks, Max On Tue, May 17, 2011 at 3:35 PM, rmail...@justemail.net wrote: I also have a problem using odfWeave on Windows XP with R R2.11.1. odfWeave fails, giving mysterious error messages. (Not quite the same as yours, but similar. I sent the info to Max Kuhn privately, but did not get a response after two tries.) My odfWeave reporting system worked fine prior to R2.12 and then the same code that ran fine under R2.11.1 stopped working. Using the very same machine and running the very same code under R2.11.1 it still runs fine today. So, something is not quite right with odfWeave on Windows XP for R R2.11.1, and I don't know what it is. My solution is to keep R2.11.1 around until it can be resolved. Eric - Original message - From: Cormac Long clong...@googlemail.com To: r-help@r-project.org Date: Fri, 13 May 2011 10:45:06 +0100 Subject: [R] issue with odfWeave running on Windows XP; question about installing packages under Linux Good morning R community, I have two questions (and a comment): 1) A problem with odfWeave. I have an odf document with a table that spans multiple pages. Each cell in the table is populated using \sexpr{R stuff}. This worked fine on my own machine (windows 7 box using any R2.x.y, for x=11) and on a colleagues machine (Windows XP box running R2.11.1). However, on a third machine (Windows XP box running R2.12.0 or R2.13.0), odfWeave fails with the following error: Error in parse(text = cmd) : text:1:36: unexpected '' 1: GLOBAL_CONTAItext:soft-page-break/ A poke around in the unzipped odt file reveals the culprit: \Sexpr{GLOBAL_CONTAItext:soft-page-break/NER$repDat$Dec[i]} which should read \Sexpr{GLOBAL_CONTAINER$repDat$Dec[i]} The page break coincides with where the table overruns from one page to the next. Now, if this was a constant error across all machines, that would be annoying, but ok. My questions are: a) Can anyone think of a sensible suggestion why has this happened only on one machine, and not on other machines? b) Is there any way of handling such silent xml modifications (apart from odfTable, which I have only just bumped into, or extremely judicious choice of table construction, which is tedious and unreliable)? 2) When installing some packages on linux (notably RODBC and XML), you need to ensure that you linux distro has extra header files installed. This is a particular issue in Ubuntu. The question is: is there any way that a package can check for necessary external header files and issue suitable warnings? For example, if you try to install RODBC on Ubuntu without first installing unixodbc-dev, the installation will fail with the error: configure: error: ODBC headers sql.h and sqlext.h not found which is useful, but not particularly suggestive of requiring unixodbc-dev A further comment on odfWeave: odfWeave uses system calls to zip and unzip when processing the odt documents. Would it not be a good idea for the odfWeave package to check for the presence of zip and unzip utilities and report accordingly when trying to install? By default, Windows XP boxes do not have these utilities installed (installing Rtools does away with this problem). Many thanks in advance, Dr. Cormac Long. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?
XiaoLiu, I can't see the options in bootControl you used here. Your error is consistent with leaving classProbs and summaryFunction unspecified. Please double check that you set them with classProbs = TRUE and summaryFunction = twoClassSummary before you ran. Max On Thu, May 12, 2011 at 7:04 PM, Jing Liu quiet_jing0...@hotmail.com wrote: Dear all, I am using the caret Package for predictors selection with a randomForest model. The following is the train function: rfFit- train(x=trainRatios, y=trainClass, method=rf, importance = TRUE, do.trace = 100, keep.inbag = TRUE, tuneGrid = grid, trControl=bootControl, scale = TRUE, metric = ROC) I wanted to use ROC as the metric for variable selection. I know that this works with the logit model by making sure that classProbs = TRUE and summaryFunction = twoClassSummary in the trainControl function. However if I do the same with randomForest, I get a warning saying that In train.default(x = trainPred, y = trainDep, method = rf, : The metric ROC was not in the result set. Accuracy will be used instead. I wonder if ROC metric can be used for randomForest? Have I missed something? Very very grateful if anyone can help! Best regards, XiaoLiu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?
Frank, It depends on how you define optimal. While I'm not a big fan of using the area under the ROC to characterize performance, there are a lot of times when likelihood measures are clearly sub-optimal in performance. Using resampled accuracy (or Kappa) instead of deviance (out-of-bag or not) is likely to produce more inaccurate models (not shocking, right?). The best example is determining the number of boosting iterations. From Friedman (2001): ``[...] degrading the likelihood by overfitting actually improves misclassification error rates. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality.'' My argument here assumes that you are fitting a model for the purposes of prediction rather than interpretation. This particular case involves random forests, so I'm hoping that statistical inference is not the goal. Ref: Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics (2001) pp. 1189-1232 Thanks, Max On Fri, May 13, 2011 at 8:11 AM, Frank Harrell f.harr...@vanderbilt.edu wrote: Using anything other than deviance (or likelihood) as the objective function will result in a suboptimal model. Frank - Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3520043.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bigining with a Program of SVR
As far as caret goes, you should read http://cran.r-project.org/web/packages/caret/vignettes/caretVarImp.pdf and look at rfe() and sbf(). On Fri, May 6, 2011 at 2:53 PM, ypriverol yprive...@gmail.com wrote: Thanks Max. I'm using now the library caret with my data. But the models showed a correlation under 0.7. Maybe the problem is with the variables that I'm using to generate the model. For that reason I'm asking for some packages that allow me to reduce the number of feature and to remove the worst features. I read recently an article taht combine Genetic algorithm with support vector regression to do that. Best Regards Yasset -- View this message in context: http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3503918.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bigining with a Program of SVR
train() uses vectors, matrices and data frames as input. I really think you need to read materials on basic R before proceeding. Go to the R web page. There are introductory materials there. On Tue, May 3, 2011 at 11:19 AM, ypriverol yprive...@gmail.com wrote: I saw the format of the caret data some days ago. It is possible to convert my csv data with the same data a format as the caret dataset. My idea is to use firstly the same scripts as caret tutorial, then i want to remove problems related with data formats and incompatibilities. Thanks for your time -- View this message in context: http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3492746.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bigining with a Program of SVR
See the examples at the end of: http://cran.r-project.org/web/packages/caret/vignettes/caretTrain.pdf for a QSAR data set for modeling the log blood-brain barrier concentration. SVMs are not used there but, if you use train(), the syntax is very similar. On Tue, May 3, 2011 at 9:38 AM, ypriverol yprive...@gmail.com wrote: well, first of all thank for your answer. I need some example that works with Support Vector Regression. This is the format of my data: VDP V1 V2 9.15 1234.5 10 9.15 2345.6 15 6.7 789.0 12 6.7 234.6 11 3.2 123.6 5 3.2 235.7 8 VDP is the experimental value of the property that i want to predict with the model and more accurate. The other variables V1, V2 ... are the properties to generate the model. I need some examples that introduce me in this field. I read some examples from e1071 but all of them are for classification problems. thanks for your help in advance -- View this message in context: http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3492487.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret - prevent resampling when no parameters to find
Yeah, that didn't work. Use fitControl-trainControl(index = list(seq(along = mdrrClass))) See ?trainControl to understand what this does in detail. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret - prevent resampling when no parameters to find
It isn't building the same model since each fit is created from different data sets. The resampling is sort of the point of the function, but if you really want to avoid it, supply your own index in trainControl that has every index (eg, index = seq(along = mdrrClass)). In this case, the performance it gives is the apparent error rate. Max On Sun, May 1, 2011 at 5:57 PM, pdb ph...@philbrierley.com wrote: I want to use caret to build a model with an algorithm that actually has no parameters to find. How do I stop it from repeatedly building the same model 25 times? library(caret) data(mdrr) LOGISTIC_model - train(mdrrDescr,mdrrClass ,method='glm' ,family=binomial(link=logit) ) LOGISTIC_model 528 samples 342 predictors 2 classes: 'Active', 'Inactive' Pre-processing: None Resampling: Bootstrap (25 reps) Summary of sample sizes: 528, 528, 528, 528, 528, 528, ... Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.552 0.0999 0.0388 0.0776 -- View this message in context: http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p3488761.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bigining with a Program of SVR
When you say variable do you mean predictors or responses? In either case, they do. You can generally tell by reading the help files and looking at the examples. Max On Fri, Apr 29, 2011 at 3:47 PM, ypriverol yprive...@gmail.com wrote: Hi: I'm starting a research of Support Vector Regression. I want to obtain a model to predict a property A with a set of property B, C, D, ... This problem is very common for example in QSAR models. I want to know some examples and package that could help me in this way. I know about caret and e1071. But I' don't know if this package can work with continues variables.? Thanks in advance -- View this message in context: http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3484476.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret - prevent resampling when no parameters to find
No, the sampling is done on rows. The definition of a bootstrap (re)sample is one which is the same size as the original data but taken with replacement. The Accuracy SD and Kappa SD columns give you a sense of how the model performance varied across these bootstrap data sets (i.e. they are not the same data set). In the end, the original training set is used to fit the final model that is used for prediction. Max On Sun, May 1, 2011 at 6:41 PM, pdb ph...@philbrierley.com wrote: Hi Max, But in this example, it says the sample size is the same as the total number of samples, so unless the sampling is done by columns, wouldn't you get exactly the same model each time for logistic regression? ps - great package btw. I'm just beginning to explore its potential now.-- View this message in context: http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p341.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret - prevent resampling when no parameters to find
Not all modeling functions have both the formula and matrix interface. For example, glm() and rpart() only have formula method, enet() has only the matrix interface and ksvm() and others have both. This was one reason I created the package (so we don't have to remember all this). train() lets you specify the model either way. When the actual model is fit, it favors the matrix interface whenever possible (since it is more efficient) and works out the details behind the scenes. For your example, you can fit the model you want using train(): train(mdrrDescr,mdrrClass,method='glm') If y is a factor, it automatically adds the 'family = binomial' option when the model is fit (so you don't have to). Max On Sun, May 1, 2011 at 7:18 PM, pdb ph...@philbrierley.com wrote: glm.fit - answered my own question by reading the manual!-- View this message in context: http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p3488923.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave Error unzipping file in Win 7
I don't think that this is the issue, but test it on a file without spaces. On Mon, Mar 21, 2011 at 2:25 PM, rmail...@justemail.net wrote: I have a very similar error that cropped up when I upgraded to R 2.12 and persists at R 2.12.1. I am running R on Windows XP and OO is at version 3.2. I did not make any changes to my R code or ODF code or configuration to produce this error. Only upgraded R. Many Thanks, Eric R session: odfWeave ( 'Report input template.odt' , 'August 2011.odt') Copying Report input template.odt Setting wd to C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2/odfWeave2153483 Unzipping ODF file using unzip -o Report input template.odt Error in odfWeave(Report input template.odt, August 2011.odt) : Error unzipping file When I start a shell and go to the temp directory in question and copy the exact command that the error message says produced an error the command runs fine. Here is that session: Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp. H:\c: C:\cd C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2/odfWeave2153483 C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483dir Volume in drive C has no label. Volume Serial Number is 7464-62CA Directory of C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483 03/21/2011 11:11 AM DIR . 03/21/2011 11:11 AM DIR .. 03/21/2011 11:11 AM 13,780 Report input template.odt 1 File(s) 13,780 bytes 2 Dir(s) 7,987,343,360 bytes free C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483unzip -o Report input template.odt Archive: Report input template.odt extracting: mimetype creating: Configurations2/statusbar/ inflating: Configurations2/accelerator/current.xml creating: Configurations2/floater/ creating: Configurations2/popupmenu/ creating: Configurations2/progressbar/ creating: Configurations2/menubar/ creating: Configurations2/toolbar/ creating: Configurations2/images/Bitmaps/ inflating: content.xml inflating: manifest.rdf inflating: styles.xml extracting: meta.xml inflating: Thumbnails/thumbnail.png inflating: settings.xml inflating: META-INF/manifest.xml C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483 - Original message - From: psycho-ld battlecry...@web.de To: r-help@r-project.org Date: Sun, 23 Jan 2011 01:47:44 -0800 (PST) Subject: [R] odfWeave Error unzipping file in Win 7 Hey guys, I´m just getting started with R (version 2.12.0) and odfWeave and kinda stumble from one problem to the next, the current one is the following: trying to use odfWeave: odfctrl - odfWeaveControl( + zipCmd = c(C:/Program Files/unz552dN/VBunzip.exe $$file$$ ., + C:/Program Files/unz552dN/VBunzip.exe $$file$$)) odfWeave(C:/testat.odt, C:/iris.odt, control = odfctrl) Copying C:/testat.odt Setting wd to D:\Users\egf\AppData\Local\Temp\Rtmpmp4E1J/odfWeave23103351832 Unzipping ODF file using C:/Program Files/unz552dN/VBunzip.exe testat.odt Fehler in odfWeave(C:/testat.odt, C:/iris.odt, control = odfctrl) : Error unzipping file so I tried a few other unzipping programs like jar and 7-zip, but still the same problem occurs, I also tried to install zip and unzip, but then I get some error message that registration failed (Error 1904 ) so if there are anymore questions, just ask, would be great if someone could help me though cheers psycho-ld -- View this message in context: http://r.789695.n4.nabble.com/odfWeave-Error-unzipping-file-in-Win-7-tp3232359p3232359.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Specify feature weights in model prediction (CARET)
Using the 'CARET' package, is it possible to specify weights for features used in model prediction? For what model? And for the 'knn' implementation, is there a way to choose a distance metric (i.e. Mahalanobis distance)? No, sorry. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] use caret to rank predictors by random forest model
It would help if you provided the code that you used for the caret functions. The most likely issues is not using importance = TRUE in the call to train() I believe that I've only implemented code for plotting the varImp objects resulting from train() (eg. there is plot.varImp.train but not plot.varImp). Max On Mon, Mar 7, 2011 at 3:27 PM, Xiaoqi Cui x...@mtu.edu wrote: Hi, I'm using package caret to rank predictors using random forest model and draw predictors importance plot. I used below commands: rf.fit-randomForest(x,y,ntree=500,importance=TRUE) ## x is matrix whose columns are predictors, y is a binary resonse vector ## Then I got the ranked predictors by ranking rf1$importance[,MeanDecreaseAccuracy] ## Then draw the importance plot varImpPlot(rf.fit) As you can see, all the functions I used are directly from the package randomForest, instead of from caret. so I'm wondering if the package caret has some functions who can do the above ranking and ploting. In fact, I tried functions train, varImp and plot from package caret, the random forest model that built by train can not be input correctly to varImp, which gave error message like subscripts out of bounds. Also function plot doesn't work neither. So I'm wondering if anybody has encountered the same problem before, and could shed some light on this. I would really appreciate your help. Thanks, Xiaoqi __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Course: R for Predictive Modeling: A Hands-On Introduction
R for Predictive Modeling: A Hands-On Introduction Predictive Analytics World in San Francisco Sunday March 13, 9am to 4:30pm This one-day session provides a hands-on introduction to R, the well-known open-source platform for data analysis. Real examples are employed in order to methodically expose attendees to best practices driving R and its rich set of predictive modeling packages, providing hands-on experience and know-how. R is compared to other data analysis platforms, and common pitfalls in using R are addressed. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC from R-SVM?
The objects functions for kernel methods are unrelated to the area under the ROC curve. However, you can try to choose the cost and kernel parameters to maximize the ROC AUC. See the caret package, specifically the train function. Max On Mon, Feb 21, 2011 at 5:34 PM, Angel Russo angerusso1...@gmail.com wrote: *Hi, *Does anyone know how can I show an *ROC curve for R-SVM*? I understand in R-SVM we are not optimizing over SVM cost parameter. Any example ROC for R-SVM code or guidance can be really useful. Thanks, Angel. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest Cross Validation
I am using randomForest package to do some prediction job on GWAS data. I firstly split the data into training and testing set (70% vs 30%), then using training set to grow the trees (ntree=10). It looks that the OOB error in training set is good (10%). However, it is not very good for the test set with a AUC only about 50%. Did you do any feature selection in the training set? If so, you also need to include that step in the cross-validation to get realistic performance estimates (see Ambroise and McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences (2002) vol. 99 (10) pp. 6562-6566). In the caret package, train() can be used to get cross-validation estimates for RF and the sbf() function (for selection by filter) can be used to include simple univariate filters in the CV procedure. Although some people said no cross-validation was necessary for RF, I still felt unsafe and thought a testing set is important. I felt really frustrated with the results. CV is needed when you want an assessment of performance on a test set. In this sense, RF is like any other method. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] caret::train() and ctree()
Andrew, ctree only tunes over mincriterion and ctree2 tunes over maxdepth (while fixing mincriterion = 0). Seeing both listed as the function is being executed is a bug. I'll setup checks to make sure that the columns specified in tuneGrid are actually the tuning parameters that are used. Max On Wed, Feb 16, 2011 at 12:01 PM, Andrew Ziem az...@us.ci.org wrote: Like earth can be trained simultaneously for degree and nprune, is there a way to train ctree simultaneously for mincriterion and maxdepth? Also, I notice there are separate methods ctree and ctree2, and if both options are attempted to tune with one method, the summary averages the option it doesn't support. The full log is attached, and notice these lines below for method=ctree where maxdepth=c(2,4) are averaged to maxdepth=3. Fitting: maxdepth=2, mincriterion=0.95 Fitting: maxdepth=4, mincriterion=0.95 Fitting: maxdepth=2, mincriterion=0.99 Fitting: maxdepth=4, mincriterion=0.99 mincriterion Accuracy Kappa maxdepth Accuracy SD Kappa SD maxdepth SD 0.95 0.939 0.867 3 0.0156 0.0337 1.01 0.99 0.94 0.868 3 0.0157 0.0337 1.01 I use R 2.12.1 and caret 4.78. Andrew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Train error:: subscript out of bonds
Sort of. It lets you define a grid of candidate values to test and to define the rule to choose the best. For some models, it is each to come up with default values that work well (e.g. RBF SVM's, PLS, KNN) while others are more data dependent. In the latter case, the defaults may not work well. MAx On Wed, Jan 26, 2011 at 5:45 AM, Neeti nikkiha...@gmail.com wrote: What I have understood in CARET train() method is that train() itself does the model selection and tune the parameter. (please correct me if I am wrong). That was my first motivation to select this package and method for fitting the model. And use the parameter to e1071 svm() method and compare the result. fit1-train(train1,as.factor(trainset[,ncol(trainset)]),svmpoly,trControl = trainControl((method = cv),10,verboseIter = F),tuneLength=3) -- View this message in context: http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3237800.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Train error:: subscript out of bonds
No. Any valid seed should work. In this case, train() should on;y be using it to determine which training set samples are in the CV or bootstrap data sets. Max On Wed, Jan 26, 2011 at 9:56 AM, Neeti nikkiha...@gmail.com wrote: Thank you so much for your reply. In my case it is giving error in some seed value for example if I set seed value to 357 this gives an error. Does train have some specific seed range? -- View this message in context: http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3238197.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Train error:: subscript out of bonds
What version of caret and R? We'll also need a reproducible example. On Mon, Jan 24, 2011 at 12:44 PM, Neeti nikkiha...@gmail.com wrote: Hi, I am trying to construct a svmpoly model using the caret package (please see code below). Using the same data, without changing any setting, I am just changing the seed value. Sometimes it constructs the model successfully, and sometimes I get an “Error in indexes[[j]] : subscript out of bounds”. For example when I set seed to 357 following code produced result only for 8 iterations and for 9th iteration it reaches to an error that “subscript out of bonds” error. I don’t understand why Any help would be great thanks ### for (i in 1:10) { fit1-NULL; x-NULL; x-which(number==i) trainset-d[-x,] testset-d[x,] train1-trainset[,-ncol(trainset)] train1-train1[,-(1)] test_t-testset[,-ncol(testset)] species_test-as.factor(testset[,ncol(testset)]) test_t-test_t[,-(1)] #CARET::TRAIN fit1-train(train1,as.factor(trainset[,ncol(trainset)]),svmpoly,trControl = trainControl((method = cv),10,verboseIter = F),tuneLength=3) pred-predict(fit1,test_t) t_train[[i]]-table(predicted=pred,observed=testset[,ncol(testset)]) tune_result[[i]]-fit1$results; tune_best-fit1$bestTune; scale1[i]-tune_best[[3]] degree[i]-tune_best[[2]] c1[i]-tune_best[[1]] } -- View this message in context: http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3234510.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] circular reference lines in splom
This did the trick: panel.circ3 - function(...) { args - list(...) circ1 - ellipse(diag(rep(1, 2)), t = 1) panel.xyplot(circ1[,1], circ1[,2], type = l, lty = trellis.par.get(reference.line)$lty, col = trellis.par.get(reference.line)$col, lwd = trellis.par.get(reference.line)$lwd) circ2 - ellipse(diag(rep(1, 2)), t = 2) panel.xyplot(circ2[,1], circ2[,2], type = l, lty = trellis.par.get(reference.line)$lty, col = trellis.par.get(reference.line)$col, lwd = trellis.par.get(reference.line)$lwd) panel.xyplot(args$x, args$y, groups = args$groups, subscripts = args$subscripts) } splom(~dat, groups = grps, lower.panel = panel.circ3, upper.panel = panel.circ3) Thanks, Max On Thu, Jan 20, 2011 at 11:13 AM, Peter Ehlers ehl...@ucalgary.ca wrote: On 2011-01-19 20:15, Max Kuhn wrote: Hello everyone, I'm stumped. I'd like to create a scatterplot matrix with circular reference lines. Here is an example in 2d: library(ellipse) set.seed(1) dat- matrix(rnorm(300), ncol = 3) colnames(dat)- c(X1, X2, X3) dat- as.data.frame(dat) grps- factor(rep(letters[1:4], 25)) panel.circ- function(x, y, ...) { circ1- ellipse(diag(rep(1, 2)), t = 1) panel.xyplot(circ1[,1], circ1[,2], type = l, lty = 2) circ2- ellipse(diag(rep(1, 2)), t = 2) panel.xyplot(circ2[,1], circ2[,2], type = l, lty = 2) panel.xyplot(x, y) } xyplot(X2 ~ X1, data = dat, panel = panel.circ, aspect = 1) I'd like to to the sample with splom, but with groups. My latest attempt: panel.circ2- function(x, y, groups, ...) { circ1- ellipse(diag(rep(1, 2)), t = 1) panel.xyplot(circ1[,1], circ1[,2], type = l, lty = 2) circ2- ellipse(diag(rep(1, 2)), t = 2) panel.xyplot(circ2[,1], circ2[,2], type = l, lty = 2) panel.xyplot(x, y, type = p, groups) } splom(~dat, panel = panel.superpose, panel.groups = panel.circ2) produces nothing but warnings: warnings() Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' It does not appear to me that panel.circ2 is even being called. Thanks, Max I don't see a function panel.groups() in lattice. Does this do what you want or am I missing the point: splom(~dat|grps, panel = panel.circ2) Peter Ehlers -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] circular reference lines in splom
Hello everyone, I'm stumped. I'd like to create a scatterplot matrix with circular reference lines. Here is an example in 2d: library(ellipse) set.seed(1) dat - matrix(rnorm(300), ncol = 3) colnames(dat) - c(X1, X2, X3) dat - as.data.frame(dat) grps - factor(rep(letters[1:4], 25)) panel.circ - function(x, y, ...) { circ1 - ellipse(diag(rep(1, 2)), t = 1) panel.xyplot(circ1[,1], circ1[,2], type = l, lty = 2) circ2 - ellipse(diag(rep(1, 2)), t = 2) panel.xyplot(circ2[,1], circ2[,2], type = l, lty = 2) panel.xyplot(x, y) } xyplot(X2 ~ X1, data = dat, panel = panel.circ, aspect = 1) I'd like to to the sample with splom, but with groups. My latest attempt: panel.circ2 - function(x, y, groups, ...) { circ1 - ellipse(diag(rep(1, 2)), t = 1) panel.xyplot(circ1[,1], circ1[,2], type = l, lty = 2) circ2 - ellipse(diag(rep(1, 2)), t = 2) panel.xyplot(circ2[,1], circ2[,2], type = l, lty = 2) panel.xyplot(x, y, type = p, groups) } splom(~dat, panel = panel.superpose, panel.groups = panel.circ2) produces nothing but warnings: warnings() Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' It does not appear to me that panel.circ2 is even being called. Thanks, Max sessionInfo() R version 2.11.1 Patched (2010-09-30 r53356) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] lattice_0.19-11 ellipse_0.3-5 loaded via a namespace (and not attached): [1] grid_2.11.1 tools_2.11.1 -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] less than full rank contrast methods
I'd like to make a less than full rank design using dummy variables for factors. Here is some example data: when - data.frame(time = c(afternoon, night, afternoon, morning, morning, morning, morning, afternoon, afternoon), day = c(Monday, Monday, Monday, Wednesday, Wednesday, Friday, Saturday, Saturday, Friday)) For a single factor, I can do this this using head(model.matrix(~time -1, data = when)) timeafternoon timemorning timenight 1 1 0 0 2 0 0 1 3 1 0 0 4 0 1 0 5 0 1 0 6 0 1 0 but this breakdown muti-variable formulas such as time + day or time + dat + time:day. I've looked for alternate contrast functions to do this and I haven't figured out a way to coerce existing functions to get the desired output. Hopefully I haven't missed anything obvious. Thanks, Max sessionInfo() R version 2.11.1 Patched (2010-09-11 r52910) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sporadic errors when training models using CARET
Kendric, I've seen these too and traceback() usually goes back to ksvm(). This doesn't mean that the error is there, but the results fo traceback() from you would be helpful. thanks, Max On Mon, Nov 22, 2010 at 6:18 PM, Kendric Wang kendr...@interchange.ubc.ca wrote: Hi. I am trying to construct a svmLinear model using the caret package (see code below). Using the same data, without changing any setting, sometimes it constructs the model successfully, and sometimes I get an index out of bounds error. Is this unexpected behaviour? I would appreciate any insights this issue. Thanks. ~Kendric train.y [1] S S S S R R R R R R R R R R R R R R R R R R R R Levels: R S train.x m1 m2 1 0.1756 0.6502 2 0.1110 -0.2217 3 0.0837 -0.1809 4 -0.3703 -0.2476 5 8.3825 2.8814 6 5.6400 12.9922 7 7.5537 7.4809 8 3.5005 5.7844 9 16.8541 16.6326 10 9.1851 8.7814 11 1.4405 11.0132 12 9.8795 2.6182 13 8.7151 4.5476 14 -0.2092 -0.7601 15 3.6876 2.5772 16 8.3776 5.0882 17 8.6567 7.2640 18 20.9386 20.1107 19 12.2903 4.7864 20 10.5920 7.5204 21 10.2679 9.5493 22 6.2023 11.2333 23 -5.0720 -4.8701 24 6.6417 11.5139 svmLinearGrid - expand.grid(.C=0.1) svmLinearFit - train(train.x, train.y, method=svmLinear, tuneGrid=svmLinearGrid) Fitting: C=0.1 Error in indexes[[j]] : subscript out of bounds svmLinearFit - train(train.x, train.y, method=svmLinear, tuneGrid=svmLinearGrid) Fitting: C=0.1 maximum number of iterations reached 0.0005031579 0.0005026807maximum number of iterations reached 0.0002505857 0.0002506714Error in indexes[[j]] : subscript out of bounds svmLinearFit - train(train.x, train.y, method=svmLinear, tuneGrid=svmLinearGrid) Fitting: C=0.1 maximum number of iterations reached 0.0003270061 0.0003269764maximum number of iterations reached 7.887867e-05 7.866367e-05maximum number of iterations reached 0.0004087571 0.0004087466Aggregating results Selecting tuning parameters Fitting model on full training set R version 2.11.1 (2010-05-31) x86_64-redhat-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] kernlab_0.9-12 pamr_1.47 survival_2.35-8 cluster_1.12.3 [5] e1071_1.5-24 class_7.3-2 caret_4.70 reshape_0.8.3 [9] plyr_1.2.1 lattice_0.18-8 loaded via a namespace (and not attached): [1] grid_2.11.1 -- MSc. Candidate CIHR/MSFHR Training Program in Bioinformatics University of British Columbia [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation using e1071:SVM
Neeti, I'm pretty sure that the error is related to the confusionMAtrix call, which is in the caret package, not e1071. The error message is pretty clear: you need to pas in two factor objects that have the same levels. You can check by running the commands: str(pred_true1) str(species_test) Also, caret can do the resampling for you instead of you writing the loop yourself. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave - Format error discovered in the file in sub-document content.xml at 2, 4047 (row, col)
Can you try it with version 7.16 on R-Forge? Use install.packages(odfWeave, repos=http://R-Forge.R-project.org;) to get it. Thanks, Max On Tue, Nov 16, 2010 at 8:26 AM, Søren Højsgaard soren.hojsga...@agrsci.dk wrote: Dear Mike, Good point - thanks. The lines that caused the error mentioned above are simply: = x - 1:10 x @ I could add that the document 'simple.odt' (which comes with odfWeave) causes the same error - but at row=109, col=1577 sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] MASS_7.3-8 odfWeave_0.7.14 XML_3.2-0.1 lattice_0.19-13 loaded via a namespace (and not attached): [1] tools_2.12.0 Regards Søren -Oprindelig meddelelse- Fra: Mike Marchywka [mailto:marchy...@hotmail.com] Sendt: 16. november 2010 12:56 Til: Søren Højsgaard; r-h...@stat.math.ethz.ch Emne: RE: [R] odfWeave - Format error discovered in the file in sub-document content.xml at 2, 4047 (row, col) From: soren.hojsga...@agrsci.dk To: r-h...@stat.math.ethz.ch Date: Tue, 16 Nov 2010 11:32:06 +0100 Subject: [R] odfWeave - Format error discovered in the file in sub-document content.xml at 2, 4047 (row, col) When using odfWeave on an OpenOffice input document, I can not open the output document. I get the message Format error discovered in the file in sub-document content.xml at 2,4047 (row,col) Can anyone help me on this? (Apologies if this has been discussed before; I have not been able to find any info...) well, if it really means line 2 you could post the first few lines. Did you expect a line with 4047 columns? Info: I am using R.2.12.0 on Windows 7 (64 bit). I have downloaded the XML package from http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.12/ and I have compiled odfWeave myself Best regards Søren [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] to determine the variable importance in svm
The caret package has answers to all your questions. 1) How to obtain a variable (attribute) importance using e1071:SVM (or other svm methods)? I haven't implemented a model-specific method for variables importance for SVM models. I know of one package (svmpath) that will return the regression coefficients (e.g. the \beta values of x'\beta) for two class models. There are probably other methods for non-linear kernels, but I haven't coded anything (any volunteers?). When there is no variable importance method implemented for classification models, caret calculates an ROC curve for each predictor and returns the AUC. For 3+ classes, it returns the maximum AUC on the one-vs-all ROC curves. Note also that caret uses ksvm in kernlab for no other reason that it has a bunch of available kernels and similar methods (rvm, etc) 2) how to validate the results of svm? If you use caret, you can look at: http://user2010.org/slides/Kuhn.pdf http://www.jstatsoft.org/v28/i05 and the four package vignettes. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest AUC
Ravishankar, I used Random Forest with a couple of data sets I had to predict for binary response. In all the cases, the AUC of the training set is coming to be 1. Is this always the case with random forests? Can someone please clarify this? This is pretty typical for this model. I have given a simple example, first using logistic regression and then using random forests to explain the problem. AUC of the random forest is coming out to be 1. Logistic regression isn't as flexible as RF and some other methods, so the ROC curve is likely to be less than one, but much higher than it really is (since you are re-predicting the same data) For you example: performance(prediction(train.predict,iris$Species),auc)@y.values[[1]] [1] 0.9972 but using simple 10-fold CV: library(caret) ctrl - trainControl(method = cv, + classProbs = TRUE, + summaryFunction = twoClassSummary) set.seed(1) cvEstimate - train(Species ~ ., data = iris, + method = glm, + metric = ROC, + trControl = ctrl) Fitting: parameter=none Aggregating results Fitting model on full training set Warning messages: 1: glm.fit: fitted probabilities numerically 0 or 1 occurred 2: glm.fit: algorithm did not converge 3: glm.fit: fitted probabilities numerically 0 or 1 occurred 4: glm.fit: algorithm did not converge 5: glm.fit: fitted probabilities numerically 0 or 1 occurred cvEstimate Call: train.formula(form = Species ~ ., data = iris, method = glm, metric = ROC, trControl = ctrl) 100 samples 4 predictors Pre-processing: Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... Resampling results Sens Spec ROC Sens SD Spec SD ROC SD 0.96 0.98 0.86 0.0843 0.0632 0.126 and for random forest: set.seed(1) rfEstimate - train(Species ~ ., + data = iris, + method = rf, + metric = ROC, + tuneGrid = data.frame(.mtry = 2), + trControl = ctrl) Fitting: mtry=2 Aggregating results Selecting tuning parameters Fitting model on full training set rfEstimate Call: train.formula(form = Species ~ ., data = iris, method = rf, metric = ROC, tuneGrid = data.frame(.mtry = 2), trControl = ctrl) 100 samples 4 predictors Pre-processing: Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... Resampling results Sens Spec ROCSens SD Spec SD ROC SD 0.94 0.92 0.898 0.0966 0.14 0.00632 Tuning parameter 'mtry' was held constant at a value of 2 -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Understanding linear contrasts in Anova using R
These two resources might also help: http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf http://cran.r-project.org/web/packages/contrast/vignettes/contrast.pdf Max On Thu, Sep 30, 2010 at 1:33 PM, Ista Zahn iz...@psych.rochester.edu wrote: Hi Professor Howell, I think the issue here is simply in the assumption that the regression coefficients will always be equal to the product of the means and the contrast codes. I tend to think of regression coefficients as the quotient of the covariance of x and y divided by the variance of x, and this definition agrees with the coefficients calculated by lm(). See below for a long-winded example. On Wed, Sep 29, 2010 at 3:42 PM, David Howell david.how...@uvm.edu wrote: #I am trying to understand how R fits models for contrasts in a #simple one-way anova. This is an example, I am not stupid enough to want #to simultaneously apply all of these contrasts to real data. With a few #exceptions, the tests that I would compute by hand (or by other software) #will give the same t or F statistics. It is the contrast estimates that R produces #that I can't seem to understand. # # In searching for answers to this problem, I found a great PowerPoint slide (I think by John Fox). # The slide pointed to the coefficients, said something like these are coeff. that no one could love, and #then suggested looking at the means to understand where they came from. I have stared # and stared at his means and then my means, but can't find a relationship. # The following code and output illustrates the problem. # Various examples of Anova using R dv - c(1.28, 1.35, 3.31, 3.06, 2.59, 3.25, 2.98, 1.53, -2.68, 2.64, 1.26, 1.06, -1.18, 0.15, 1.36, 2.61, 0.66, 1.32, 0.73, -1.06, 0.24, 0.27, 0.72, 2.28, -0.41, -1.25, -1.33, -0.47, -0.60, -1.72, -1.74, -0.77, -0.41, -1.20, -0.31, -0.74, -0.45, 0.54, -0.98, 1.68, 2.25, -0.19, -0.90, 0.78, 0.05, 2.69, 0.15, 0.91, 2.01, 0.40, 2.34, -1.80, 5.00, 2.27, 6.47, 2.94, 0.47, 3.22, 0.01, -0.66) group - factor(rep(1:5, each = 12)) # Use treatment contrasts to compare each group to the first group. options(contrasts = c(contr.treatment,contr.poly)) # The default model2 - lm(dv ~ group) summary(model2) # Summary table is the same--as it should be # Intercept is Group 1 mean and other coeff. are deviations from that. # This is what I would expect. #summary(model1) # Df Sum Sq Mean Sq F value Pr(F) # group 4 62.46 15.6151 6.9005 0.0001415 *** # Residuals 55 124.46 2.2629 #Coefficients: # Estimate Std. Error t value Pr(|t|) #(Intercept) 1.80250 0.43425 4.151 0.000116 *** #group2 -1.12750 0.61412 -1.836 0.071772 . #group3 -2.71500 0.61412 -4.421 4.67e-05 *** #group4 -1.25833 0.61412 -2.049 0.045245 * #group5 0.08667 0.61412 0.141 0.888288 # Use sum contrasts to compare each group against grand mean. options(contrasts = c(contr.sum,contr.poly)) model3 - lm(dv ~ group) summary(model3) # Again, this is as expected. Intercept is grand mean and others are deviatoions from that. #Coefficients: # Estimate Std. Error t value Pr(|t|) # (Intercept) 0.7997 0.1942 4.118 0.000130 *** # group1 1.0028 0.3884 2.582 0.012519 * # group2 -0.1247 0.3884 -0.321 0.749449 # group3 -1.7122 0.3884 -4.408 4.88e-05 *** # group4 -0.2555 0.3884 -0.658 0.513399 #SO FAR, SO GOOD # IF I wanted polynomial contrasts BY HAND I would use # a(i) = -2 -1 0 1 2 for linear contrast (or some linear function of this ) # Effect = Sum(a(j)M(i)) # where M = mean # Effect(linear) = -2(1.805) -1(0.675) +0(-.912) +1(.544) +2(1.889) = 0.043 # SS(linear) = n*(Effect(linear)^2)/Sum((a(j)^2)) = 12(.043)/10 = .002 # F(linear) = SS(linear)/MS(error) = .002/2.263 = .001 # t(linear) = sqrt(.001) = .031 # To do this in R I would use order.group - ordered(group) model4 - lm(dv~order.group) summary(model4) # This gives: #Coefficients: # Estimate Std. Error t value Pr(|t|) # (Intercept) 0.79967 0.19420 4.118 0.000130 *** # order.group.L 0.01344 0.43425 0.031 0.975422 # order.group.Q 2.13519 0.43425 4.917 8.32e-06 *** # order.group.C 0.11015 0.43425 0.254 0.800703 # order.group^4 -0.79602 0.43425 -1.833 0.072202 . # The t value for linear is same as I got (as are others) but I don't understand # the estimates. The intercept is the grand mean, but I don't see the relationship # of other estimates to that or to the ones I get by hand. # My estimates are the sum of (coeff times means) i.e. 0 (intercept), .0425, 7.989, .3483, -6.66 # and these are not a linear (or other nice pretty) function of est. from R. # OK, let's break it down Means - tapply(dv, order.group,
Re: [R] Creating publication-quality plots for use in Microsoft Word
You might want to check out the Reproducible Research task view: http://cran.r-project.org/web/views/ReproducibleResearch.html There is a section on Microsoft formats, as well as other formats that can be converted. Max On Wed, Sep 15, 2010 at 11:49 AM, Thomas Lumley tlum...@u.washington.edu wrote: On Wed, 15 Sep 2010, dadrivr wrote: Thanks for your help, guys. I'm looking to produce a high-quality plot (no jagged lines or other distortions) with a filetype that is accepted by Microsoft Word on a PC and that most journals will accept. That's why I'd prefer to stick with JPEG, TIFF, PNG, or the like. I'm not sure EPS would fly. One simple approach, which I use when I have to create graphics for MS Office while on a non-Windows platform is to use PNG and set the resolution and file size large enough. At 300dpi or so the physics of ink on paper does all the antialiasing you need. Work out how big you want the graph to be, and use PNG with enough pixels to get at least 300dpi at that final size. You'll need to set the pointsize argument and it will help to set the resolution argument. -thomas Thomas Lumley Professor of Biostatistics University of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reproducible research
A Reproducible Research CRAN task view was recently created: http://cran.r-project.org/web/views/ReproducibleResearch.html I will be updating it with some of the information in this thread. thanks, Max On Thu, Sep 9, 2010 at 11:41 AM, Matt Shotwell shotw...@musc.edu wrote: Well, the attachment was a dud. Try this: http://biostatmatt.com/R/markup_0.0.tar.gz -Matt On Thu, 2010-09-09 at 10:54 -0400, Matt Shotwell wrote: I have a little package I've been using to write template blog posts (in HTML) with embedded R code. It's quite small but very flexible and extensible, and aims to do something similar to Sweave and brew. In fact, the package is heavily influenced by the brew package, though implemented quite differently. It depends on the evaluate package, available in the CRAN. The tentatively titled 'markup' package is attached. After it's installed, see ?markup and the few examples in the inst/ directory, or just example(markup). -Matt On Thu, 2010-09-09 at 01:47 -0400, David Scott wrote: I am investigating some approaches to reproducible research. I need in the end to produce .html or .doc or .docx. I have used hwriter in the past but have had some problems with verbatim output from R. Tables are also not particularly convenient. I am interested in R2HTML and R2wd in particular, and possibly odfWeave. Does anyone have sample documents using any of these approaches which they could let me have? David Scott _ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.sc...@auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Matthew S. Shotwell Graduate Student Division of Biostatistics and Epidemiology Medical University of South Carolina __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] createDataPartition
Trafim, You'll get more answers if you adhere to the posting guide and tell us you version information and other necessary details. For example, this function is in the caret package (but nobody but me probably knows that =]). The first argument should be a vector of outcome values (not the possible classes). For the iris data, this means something like: createDataPartition(iris$Species) if you were trying to predict the species. The function does stratified splitting; the data are split into training and test sets within each class, then the results are aggregated to get the entire training set indicators. Setting a proportion per class won't do anything. Look at the man page or the (4) package vignettes for examples. Max On Thu, Sep 9, 2010 at 7:52 AM, Trafim Vanishek rdapam...@gmail.com wrote: Dear all, does anyone know how to define the structure of the required samples using function createDataPartition, meaning proportions of different types of variable in the partition? Smth like this for iris data: createDataPartition(y = c(setosa = .5, virginica = .3, versicolor = .2), times = 10, p = .7, list = FALSE) Thanks a lot for your help. Regards, Trafim [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] several odfWeave questions
Ben, 1a. am I right in believing that odfWeave does not respect the 'keep.source' option? Am I missing something obvious? I believe it does, since this gets passed directly to Sweave. 1b. is there a way to set global options analogous to \SweaveOpts{} directives in Sweave? (I looked at odfWeaveControl, it doesn't seem to do it.) Yes. There are examples of this in the 'examples' package directory. 2. I tried to write a Makefile directive to process files from the command line: %.odt: %_in.odt $(RSCRIPT) -e library(odfWeave); odfWeave(\$*_in.odt\,\$*.odt\); This works, *but* the resulting output file gives a warning (The file 'odftest2.odt' is corrupt and therefore cannot be opened. OpenOffice.org can try to repair the file ...). Based on looking at the contents, it seems that a spurious/unnecessary 'Rplots.pdf' file is getting created and zipped in with the rest of the archive; when I unzip, delete the Rplots.pdf file and re-zip, the ODT file opens without a warning. Obviously I could post-process but it would be nicer to find a workaround within R ... Get the latest version form R-Forge. I haven't gotten this fix onto CRAN yet (I've been on a caret streak lately). 3. I find the requirement that all file paths be specified as absolute rather than relative paths somewhat annoying -- I understand the reason, but it goes against one practice that I try to encourage for reproducibility, which is *not* to use absolute file paths -- when moving a same set of data and analysis files across computers, it's hard to enforce them all ending up in the same absolute location, which then means that the recipient has to edit the ODT file. It would be nice if there were hooks for read.table() and load() as there are for plotting and package/namespace loading -- then one could just copy them into the working directory on the fly. has anyone experienced this/thought of any workarounds? (I guess one solution is to zip any necessary source files into the archive beforehand, as illustrated in the vignette.) You can set the working directory with the (wait for it...) 'workDir' argument. Using 'workDir = getwd()' will pack and unpack the files in the current location and you wouldn't need to worry about setting the path. I use the temp directory because I started over-wrting files. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] odfWeave Issue.
What does this mean? It's impossible to tell. Read the posting guide and figure out all the details that you left out. If we don't have more information, you should have low expectations about the quality of any replies to might get. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] UseR! 2010 - my impressions
Not to beat a dead horse... I've found that I like the useR conferences more than most statistics conferences. This isn't due to the difference in content, but the difference in the audience and the environment. For example, everyone is at useR because of their appreciation of R. At most other conferences, there is a much wider focus of topics and less group cohesion. Given this, I think that the environment is more congenial. I've had many discussions with people that are in completely different fields than myself (e.g. imaging, forestry, physics, etc) that would be less likely to occur at other scientific meetings. Another difference between useR and the average (statistics) conference is the network effect is stronger. I believe that there is a much higher likelihood that a random person is acquainted with a different random attendee. This could be because of we've used their package, they run a local RUG or they are one of the principal people who drive R (Uwe, Kurt, etc). Anyway, well done. Max On Mon, Jul 26, 2010 at 11:49 AM, Tal Galili tal.gal...@gmail.com wrote: Dear Ravi - I echo everything you wrote, useR2010 was an amazing experience (for me, and for many others with whom I have spoken about it). Many thanks should go to the wonderful people who put their efforts into making this conference a reality (and Kate is certainly one of them). Thank you for expressing feelings I had using your own words. Best, Tal Contact Details:--- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) -- On Sat, Jul 24, 2010 at 2:50 AM, Ravi Varadhan rvarad...@jhmi.edu wrote: Dear UseRs!, Everything about UseR! 2010 was terrific! I really mean everything - the tutorials, invited talks, kaleidoscope sessions, focus sessions, breakfast, snacks, lunch, conference dinner, shuttle services, and the participants. The organization was fabulous. NIST were gracious hosts, and provided top notch facilities. The rousing speech by Antonio Possolo, who is the chief of Statistical Engineering Division at NIST, set the tempo for the entire conference. Excellent invited lectures by Luke Tierney, Frank Harrell, Mark Handcock, Diethelm Wurtz, Uwe Ligges, and Fritz Leisch. All the sessions that I attended had many interesting ideas and useful contributions. During the whole time that I was there, I could not help but get the feeling that I am a part of something great. Before I end, let me add a few words about a special person. This conference would not have been as great as it was without the tireless efforts of Kate Mullen. The great thing about Kate is that she did so much without ever hogging the limelight. Thank you, Kate and thank you NIST! I cannot wait for UseR!2011! Best, Ravi. Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest - Strata
The index indicates which samples should go into the training set. However, you are using out of bag sampling, so it would use the whole training set and return the OOB error (instead of the error estimates that would be produced by resampling via the index). Which do you want? OOB estimates or other estimates? Based on your previous email, I figured you would have an index list with three sets of sample indicies for sites A+B, sites A+C and sites B+C. In this way you would do three resamples: the first fits using data from sites A B, then predicts on C (and so on). In this way, the resampled error estimates would be based on the average of the three hold-out sets (actually hold-out sites). OOB error doesn't sound like what you want. MAx On Tue, Jul 27, 2010 at 2:46 PM, Coll gbco...@gmail.com wrote: Thanks for all the help. I had tried using the index in caret to try to dictate which rows of the sample would be used in each of the tree building in RF. (e.g. use all data from A B site for training, hold out all data from C site for testing etc) However after running, when I cross-checked the index that goes to train function and the inbag in the resulting randomForest object, I found the two didn't match. Shown as below: data(iris) tmpIrisIndex - createDataPartition(iris$Species, p=0.632, times = 10) head(tmpIrisIndex,3) [[1]] [1] 1 2 3 7 10 11 12 13 16 18 20 22 24 25 26 27 28 29 31 [20] 34 35 36 37 38 39 40 41 43 46 47 48 50 52 53 55 56 57 58 [39] 61 64 65 66 67 68 69 71 74 75 76 77 79 82 83 84 85 86 88 [58] 90 91 92 94 96 98 99 102 103 104 106 108 109 111 112 113 114 115 116 [77] 117 119 120 121 123 126 128 129 130 131 132 134 136 139 140 141 143 146 147 [96] 150 [[2]] [1] 1 3 6 7 8 10 12 13 14 16 18 20 21 22 23 24 26 27 28 [20] 29 30 32 34 35 36 38 42 44 46 47 48 50 51 53 54 55 58 60 [39] 61 62 67 68 69 70 72 73 74 76 77 79 81 82 83 85 86 88 89 [58] 90 92 93 95 97 99 100 103 104 105 107 108 109 111 112 113 114 117 119 [77] 120 121 122 123 124 125 127 130 132 133 134 135 137 139 140 141 142 145 147 [96] 149 [[3]] [1] 1 5 7 9 10 11 12 14 18 20 21 22 23 24 26 29 30 31 33 [20] 34 35 36 37 38 39 40 44 45 46 47 48 49 51 52 53 54 56 58 [39] 61 63 65 66 69 70 72 74 75 76 77 78 79 80 82 83 85 86 87 [58] 90 91 92 93 94 98 100 102 103 105 106 107 109 110 113 114 115 116 117 [77] 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141 142 146 [96] 150 irisTrControl - trainControl(method = oob, index = tmpIrisIndex) rf.iris.obj -train(Species~., data= iris, method = rf, ntree = 10, keep.inbag = TRUE, trControl = irisTrControl) Fitting: mtry=2 Fitting: mtry=3 Fitting: mtry=4 head(rf.iris.obj$finalModel$inbag,20) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0 1 0 0 0 1 0 1 1 [2,] 1 1 1 1 1 0 1 0 1 0 [3,] 1 1 1 0 0 1 1 0 0 0 [4,] 1 0 1 0 1 1 0 1 0 1 [5,] 0 1 1 1 1 1 0 1 0 1 [6,] 1 1 0 1 0 0 1 1 1 0 [7,] 1 1 0 0 1 1 0 0 0 0 [8,] 1 1 1 1 1 0 1 1 1 1 [9,] 1 1 0 1 0 1 0 1 1 0 [10,] 1 1 1 0 1 1 0 0 0 1 [11,] 1 1 1 1 1 1 1 0 1 0 [12,] 1 1 1 1 1 0 1 0 1 1 [13,] 1 0 1 1 1 1 1 1 0 1 [14,] 0 1 1 1 0 1 0 0 0 0 [15,] 1 1 1 1 1 1 1 1 1 0 [16,] 1 1 0 0 0 0 1 0 1 1 [17,] 1 0 1 0 0 0 1 1 0 1 [18,] 1 0 1 1 1 1 1 1 1 1 [19,] 1 0 1 0 1 1 1 0 1 1 [20,] 1 0 1 0 1 1 1 0 1 0 My understanding is the 1st tree in the RF should be built with tmpIrisIndex[1] i.e. 1 2 3 7 10 11 12 13 ... ? But the Inbag in the resulting forest is showing it is using 1 2 3 4 6 7 8 9... for inbag in 1st tree? Why the index passed to train does not match what got from inbag in the rf object? Or I had looked to the wrong place to check this? Any help / comments would be appreciated. Thanks a lot. Regards, Coll -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Strata-tp2295731p2303958.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal,