from:"Max Kuhn"

[R] failure with merge

2016-07-14 Thread Max Kuhn

I am merging two data frames:

tuneAcc <- structure(list(select = c(FALSE, TRUE), method =
structure(c(1L, 1L), .Label = "GCV.Cp", class = "factor"), RMSE =
c(29.2102056093962, 28.9743318817886), Rsquared =
c(0.0322612161559773, 0.0281713457306074), RMSESD = c(0.981573768028697,
0.791307778398384), RsquaredSD = c(0.0388188469162352,
0.0322578925071113)),
.Names = c("select", "method", "RMSE", "Rsquared", "RMSESD",
"RsquaredSD"),
class = "data.frame", row.names = 1:2)

finalTune <- structure(list(select = TRUE, method = structure(1L,
.Label = "GCV.Cp", class = "factor"), Selected = "*"), .Names =
c("select", "method", "Selected"), row.names = 2L, class = "data.frame")

using

   merge(x = tuneAcc, y = finalTune, all.x = TRUE)

The error is

  "Error in match.arg(method) : 'arg' must be NULL or a character vector"

This is R version 3.3.1 (2016-06-21), Platform: x86_64-apple-darwin13.4.0
(64-bit), Running under: OS X 10.11.5 (El Capitan).



These do not stop execution:

  merge(x = tuneAcc, y = finalTune)
  merge(x = tuneAcc, y = finalTune, all.x = TRUE, sort = FALSE)

The latter produces (what I consider to be) incorrect results.

Walking through the code, the original call with just `all.x = TRUE` fails
when sorting at the line:

  res <- res[if (all.x || all.y)
do.call("order", x[, seq_len(l.b), drop = FALSE]) else
 sort.list(bx[m$xi]), , drop = FALSE]

Specifically, on the `do.call` bit. For these data:

  Browse[3]> x
  select method RMSE Rsquared RMSESD RsquaredSD
  2 TRUE GCV.Cp 28.97433 0.02817135 0.7913078 0.03225789
  1 FALSE GCV.Cp 29.21021 0.03226122 0.9815738 0.03881885


  Browse[3]> x[, seq_len(l.b), drop = FALSE]
  select method
  2 TRUE GCV.Cp
  1 FALSE GCV.Cp

and this line executes:

  Browse[3]> order(x[, seq_len(l.b), drop = FALSE])
  [1] 1 2 3 4

although nrow(x) = 2 so this is an issue.

Calling it this way stops execution:

Browse[3]> do.call("order", x[, seq_len(l.b), drop = FALSE])
Error in match.arg(method) : 'arg' must be NULL or a character vector

Thanks,

Max

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Installing Caret

2016-06-16 Thread Max Kuhn

The problem is not with `caret. Your output says:

 > installation of package ‘minqa’ had non-zero exit status

`caret` has a dependency that has a dependency on `minqa`. The same is true
for `RcppEigen` and the others.

What code did you use to do the install? What OS and version or R etc?


On Thu, Jun 16, 2016 at 4:49 AM, TJUN KIAT TEO  wrote:

> I am trying to install the package but am I keep getting this error
> messages
>
>
>
>   installation of
> package ‘minqa’ had non-zero exit status
>
> 2: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/;) :
>
>   installation of
> package ‘RcppEigen’ had non-zero exit status
>
> 3: In install.packages("caret", repos = "http://cran.stat.ucla.edu/;)
> :
>
>   installation of
> package ‘SparseM’ had non-zero exit status
>
> 4: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/;) :
>
>   installation of
> package ‘lme4’ had non-zero exit status
>
> 5: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/;) :
>
>   installation of
> package ‘quantreg’ had non-zero exit status
>
> 6: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/;) :
>
>   installation of
> package ‘pbkrtest’ had non-zero exit status
>
> 7: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/;) :
>
>   installation of
> package ‘car’ had non-zero exit status
>
> 8: In install.packages("caret", repos =
> "http://cran.stat.ucla.edu/;) :
>
>   installation of
> package ‘caret’ had non-zero exit status
>
>
> Anyone has any idea what wrong?
>
> Tjun Kiat
>
>
>
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Max Kuhn

I've brought this up numerous times... you shouldn't use `predict.rpart`
(or whatever modeling function) from the `finalModel` object. That object
has no idea what was done to the data prior to its invocation.

The issue here is that `train(formula)` converts the factors to dummy
variables. `rpart` does not require that and the `finalModel` object has no
idea that that happened. Using `predict.train` works just fine so why not
use it?

> table(predict(tr_m, newdata = testPFI))

-2617.42857142857 -1786.76923076923 -1777.583   -1217.3
3 3 6 3
-886.6667  -408.375-375.7 -240.307692307692
5 1 4 5
-201.612903225806 -19.6071428571429  30.80833  43.9
   307266 9
151.5  209.647058823529
628

On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions *[This command raises the error]*
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> *muhammad2.bi...@live.uwe.ac.uk* <olugbenga2.akin...@live.uwe.ac.uk>
>
>
> --
> *From:* Max Kuhn <mxk...@gmail.com>
> *Sent:* 09 May 2016 17:22:22
> *To:* Muhammad Bilal
> *Cc:* Bert Gunter; r-help@r-project.org
>
> *Subject:* Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> muhammad2.bi...@live.uwe.ac.uk> wrote:
>
>> Hi Bert,
>>
>> Thanks for the response.
>>
>> I checked the datasets, however, the Hospitals level appears in both of
>> them. See the output below:
>>
>> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
>> sector count(*)
>> 1  Defense9
>> 2Hospitals  101
>> 3  Housing   32
>> 4   Others   99
>> 5 Public Buildings   39
>> 6  Schools  148
>> 7  Social Care   10
>> 8  Transportation   27
>> 9Waste   26
>> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
>> sector count(*)
>> 1  Defense5
>> 2Hospitals   47
>> 3  Housing   11
>> 4   Others   44
>> 5 Public Buildings   18
>> 6  Schools   69
>> 7  Social Care9
>> 8   Transportation8
>> 9Waste   12
>>
>> Any thing else to try?
>>
>> --
>> Muhammad Bilal
>> Research Fellow and Doctoral Researcher,
>> Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> University of the West of England (UWE),
>> Frenchay Campus,
>> Bristol,
>> BS16 1QY
>>
>> muhammad2.bi...@live.uwe.ac.uk
>>
>>
>> 
>> From: Bert Gunter <bgunter.4...@gmail.com>

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Max Kuhn

It is extremely difficult to tell what the issue might be without a
reproducible example.

The only thing that I can suggest is to use the non-formula interface to
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Hi Bert,
>
> Thanks for the response.
>
> I checked the datasets, however, the Hospitals level appears in both of
> them. See the output below:
>
> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
> sector count(*)
> 1  Defense9
> 2Hospitals  101
> 3  Housing   32
> 4   Others   99
> 5 Public Buildings   39
> 6  Schools  148
> 7  Social Care   10
> 8  Transportation   27
> 9Waste   26
> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
> sector count(*)
> 1  Defense5
> 2Hospitals   47
> 3  Housing   11
> 4   Others   44
> 5 Public Buildings   18
> 6  Schools   69
> 7  Social Care9
> 8   Transportation8
> 9Waste   12
>
> Any thing else to try?
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk
>
>
> 
> From: Bert Gunter 
> Sent: 09 May 2016 01:42:39
> To: Muhammad Bilal
> Cc: r-help@r-project.org
> Subject: Re: [R] Problem while predicting in regression trees
>
> It seems that the data that you used for prediction contained a level
> "Hospitals" for the sector factor that did not appear in the training
> data (or maybe it's the other way round). Check this.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
>  wrote:
> > Hi All,
> >
> > I have the following script, that raises error at the last command. I am
> new to R and require some clarification on what is going wrong.
> >
> > #Creating the training and testing data sets
> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
> >
> >
> > #Structure of the trainPFI data frame
> >> str(trainPFI)
> > ***
> > 'data.frame': 491 obs. of  16 variables:
> >  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
> >  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
> >  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
> >  $ sector : Factor w/ 9 levels "Defense","Hospitals",..:
> 4 4 4 6 6 6 6 6 6 6 ...
> >  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey"
> ...
> >  $ project_duration   : int  1826 3652 121 730 730 790 522 819 998
> 372 ...
> >  $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
> >  $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
> 60.5 78 ...
> >  $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
> >  $ delay_type : Ord.factor w/ 9 levels "7 months early &
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
> >
> > library(caret)
> > library(e1071)
> >
> > set.seed(100)
> >
> > tr.control <- trainControl(method="cv", number=10)
> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
> >
> > #Fitting the model using regression tree
> > tr_m <- train(project_delay ~ project_lon + project_lat +
> project_duration + sector + contract_type + capital_value, data = trainPFI,
> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
> >
> > tr_m
> >
> > CART
> > 491 samples
> > 15 predictor
> > No pre-processing
> > Resampling: Cross-Validated (10 fold)
> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> > Resampling results across tuning parameters:
> >   cp RMSE  Rsquared
> >   0.000  441.1524  0.5417064
> >   0.001  439.6319  0.5451104
> >   0.002  437.4039  0.5487203
> >   0.003  432.3675  0.551
> >   0.004  434.2138  0.5519964
> >   0.005  431.6635  0.551
> >   0.006  436.6163  0.5474135
> >   0.007  440.5473  0.5407240
> >   0.008  441.0876  0.5399614
> >   0.009  441.5715  0.5401718
> >   0.010  441.1401  0.5407121
> > RMSE was used to select the optimal model using  the smallest value.
> > The final value used for the model was cp = 0.005.
> >
> > #Fetching the best tree
> > best_tree <- tr_m$finalModel
> >
> > Alright, all the aforementioned commands worked fine.
> >
> > Except the subsequent command raises error, when the developed model is
> used to make predictions:
> > best_tree_pred <- predict(best_tree, newdata = testPFI)
> >

Re: [R] Mixture Discriminant Analysis and Penalized LDA

2016-01-25 Thread Max Kuhn

There is a function called `smda` in the sparseLDA package that implements
the model described in Clemmensen, L., Hastie, T., Witten, D. and Ersbøll,
B. Sparse discriminant analysis, Technometrics, 53(4): 406-413, 2011

Max

On Sun, Jan 24, 2016 at 10:45 PM, TJUN KIAT TEO 
wrote:

> Hi
>
> I noticed we have MDA and Mclust for Mixture Discriminant Analysis and
> Penalized LDA. Do we have a R packages for Penalized MDA?
>
> Tjun Kiat
>
>
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Caret - Recursive Feature Elimination Error

2015-12-23 Thread Max Kuhn

Providing a reproducible example and the results of `sessionInfo` will help
get your question answered.

Also, what is the point of using glmnet with RFE? It already does feature
selection.

On Wed, Dec 23, 2015 at 1:48 AM, Manish MAHESHWARI  wrote:

> Hi,
>
> I am trying to use caret, for feature selection on glmnet. I get a strange
> error like below - "arguments imply differing number of rows: 2, 3".
>
>
> x <- data.matrix(train[,features])
>
> y <- train$quoteconversion_flag
>
>
>
> > str(x)
>
>  num [1:260753, 1:297] NA NA NA NA NA NA NA NA NA NA ...
>
>  - attr(*, "dimnames")=List of 2
>
>   ..$ : NULL
>
>   ..$ : chr [1:297] "original_quote_date" "field6" "field7" "field8" ...
>
> > str(y)
>
>  Factor w/ 2 levels "X0","X1": 1 1 1 1 1 1 1 1 1 1 ...
>
> > RFE <- rfe(x,y,sizes = seq(50,300,by=10),
> +metric = "ROC",maximize=TRUE,rfeControl = MyRFEcontrol,
> +method='glmnet',
> +tuneGrid = expand.grid(.alpha=0,.lambda=c(0.01,0.02)),
> +trControl = MyTrainControl)
> +(rfe) fit Resample01 size: 297
> +(rfe) fit Resample02 size: 297
> +(rfe) fit Resample03 size: 297
> +(rfe) fit Resample04 size: 297
> +(rfe) fit Resample05 size: 297
> +(rfe) fit Resample06 size: 297
> +(rfe) fit Resample07 size: 297
> +(rfe) fit Resample08 size: 297
> +(rfe) fit Resample09 size: 297
> +(rfe) fit Resample10 size: 297
> +(rfe) fit Resample11 size: 297
> +(rfe) fit Resample12 size: 297
> +(rfe) fit Resample13 size: 297
> +(rfe) fit Resample14 size: 297
> +(rfe) fit Resample15 size: 297
> +(rfe) fit Resample16 size: 297
> +(rfe) fit Resample17 size: 297
> +(rfe) fit Resample18 size: 297
> +(rfe) fit Resample19 size: 297
> +(rfe) fit Resample20 size: 297
> +(rfe) fit Resample21 size: 297
> +(rfe) fit Resample22 size: 297
> +(rfe) fit Resample23 size: 297
> +(rfe) fit Resample24 size: 297
> +(rfe) fit Resample25 size: 297
> Error in { :
>   task 1 failed - "task 1 failed - "arguments imply differing number of
> rows: 2, 3""
> In addition: There were 50 or more warnings (use warnings() to see the
> first 50)
>
> Any idea what does this mean?
>
> Thanks,
> Manish
>
> CONFIDENTIAL NOTE:
> The information contained in this email is intended on...{{dropped:13}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Error in 'Contrasts<-' while using GBM.

2015-11-29 Thread Max Kuhn

Providing a reproducible example and the results of `sessionInfo` will help
get your question answered.

My only guess is that one or more of your predictors are factors and that
the in-sample data (used to build the model during resampling) have
different levels than the holdout samples.

Max

On Sat, Nov 28, 2015 at 10:04 PM, Karteek Pradyumna Bulusu <
kartikpradyumn...@gmail.com> wrote:

> Hey,
>
> I was trying to implement Stochastic Gradient Boosting in R. Following is
> my code in rstudio:
>
>
>
> library(caret);
>
> library(gbm);
>
> library(plyr);
>
> library(survival);
>
> library(splines);
>
> library(mlbench);
>
> set.seed(35);
>
> stack = read.csv("E:/Semester 3/BDA/PROJECT/Sample_SO.csv", head
> =TRUE,sep=",");
>
> dim(stack); #displaying dimensions of the dataset
>
>
>
> #SPLITTING TRAINING AND TESTING SET
>
> totraining <- createDataPartition(stack$ID, p = .6, list = FALSE);
>
> training <- stack[ totraining,]
>
> test <- stack[-totraining,]
>
>
>
> #PARAMETER SETTING
>
> t_control <- trainControl(method = "cv", number = 10);
>
>
>
>
>
> # GLM
>
> start <- proc.time();
>
>
>
> glm = train(ID ~ ., data = training,
>
>  method = "gbm",
>
>  metric = "ROC",
>
>  trControl = t_control,
>
>  verbose = FALSE)
>
>
>
> When I am compiling last line, I am getting following error:
>
>
>
> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>
>   contrasts can be applied only to factors with 2 or more levels
>
>
>
>
>
> Can anyone tell me where I am going wrong and How to rectify it. It’ll be
> greatful.
>
>
>
> Thank you. Looking forward to it.
>
>
>
> Regards,
> Karteek Pradyumna Bulusu.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Ensure distribution of classes is the same as prior distribution in Cross Validation

2015-11-24 Thread Max Kuhn

Right now, using `method = "cv"` or `method = "repeatedcv"` does stratified
sampling. Depending on what you mean by "ensure" and the nature of your
outcome (categorical?), it probably already does.

On Mon, Nov 23, 2015 at 7:04 PM, TJUN KIAT TEO  wrote:

> In the caret train control function, is it possible to ensure Ensure
> distribution of classes is the same as prior distribution in the folds of
> cross
>  validation? I know it can be done using create folds but was wondering if
> it is possible using train control?
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Caret Internal Data Representation

2015-11-06 Thread Max Kuhn

Providing a reproducible example and the results of `sessionInfo` will help
get your question answered.  For example, did you use the formula or
non-formula interface to `train` and so on

On Thu, Nov 5, 2015 at 1:10 PM, Bert Gunter  wrote:

> I am not familiar with caret/Cubist, but assuming they follow the
> usual R procedures that encode categorical factors for conditional
> fitting, you need to do some homework on your own by reading up on the
> use of contrasts in regression.
>
> See ?factor and ?contrasts (and other linked Help as necessary) to see
> what are R's usual procedures, but you will undoubtedly need to
> consult outside statistical references -- the help files will point
> you to some -- to fully understand what's going on. It is not trivial.
>
> Cheers,
> Bert
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>-- Clifford Stoll
>
>
> On Thu, Nov 5, 2015 at 9:38 AM, Lorenzo Isella 
> wrote:
> > Dear All,
> > I have a data set which contains both categorical and numerical
> > variables which I analyze using Cubist+the caret framework.
> > Now, from the generated rules, it is clear that cubist does something
> > to the categorical variables and probably uses some dummy coding for
> > them.
> > However, I cannot right now access the data the way it is transformed
> > by cubist.
> > If caret (or the package) need to do some dummy coding of the factors,
> > how can I access the newly encoded data set?
> > I suppose this applies to plenty of other packages.
> > Any suggestion is welcome.
> > Cheers
> >
> > Lorenzo
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Imbalanced random forest

2015-07-29 Thread Max Kuhn

This might help:

http://bit.ly/1MUP0Lj

On Wed, Jul 29, 2015 at 11:00 AM, jpara3 j.para.fernan...@hotmail.com
wrote:

 ¿How can i set up a study with random forest where the response is highly
 imbalanced?



 -

 Guided Tours Basque Country

 Guided tours in the three capitals of the Basque Country: Bilbao,
 Vitoria-Gasteiz and San Sebastian, as well as in their provinces. Available
 languages.

 Travel planners for groups and design of tourist routes across the Basque
 Country.
 --
 View this message in context:
 http://r.789695.n4.nabble.com/Imbalanced-random-forest-tp4710524.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] what constitutes a 'complete sentence'?

2015-07-07 Thread Max Kuhn

On Tue, Jul 7, 2015 at 8:19 AM, John Fox j...@mcmaster.ca wrote:

 Dear Peter,

 You're correct that these examples aren't verb phrases (though the second
 one contains a verb phrase). I don't want to make the discussion even more
 pedantic (moving it in this direction was my fault), but Paragraph isn't
 quite right, unless explained, because conventionally a paragraph consists
 of sentences.

 How about something like this? One can use several complete sentences or
 punctuated telegraphic phrases, but only one paragraph (that is, block of
 continuous text with no intervening blank lines). The description should
 end with a full stop (period).


Before we start crafting better definitions of the rule, it seems important
to understand what issue we are trying to solve. I don't see any place
where this has been communicated. As I said previously, I usually give them
the benefit of the doubt. However, this requirement is poorly implemented
and we need to know more.

For example, does CRAN need to parse the text and the code failed because
there was no period? It seems plausible that someone could have worded that
requirement in the current form, but it is poorly written (which is
unusual).

If the goal is to improve the quality of the description text, then that is
a more difficult issue to define. and good luck coding your way into a
lucid and effective set of rules. It also seems a bit over the top to me
and a poor choice of where everyone should be spending their time.

What are we trying to fix?

It would likely be helpful to add some examples of good and bad
 descriptions, and to explain how the check actually works.

 Best,
  John

 On Tue, 7 Jul 2015 12:20:38 +0200
  peter dalgaard pda...@gmail.com wrote:
  ...except that there is not necessarily a verb either. What we're
 looking for is something like advertisement style as in
 
  UGLY MUGS 7.95.
 
  An invaluable addition to your display cabinet. Comes in an assortment
 of warts and wrinkles, crafted by professional artist Foo Yung.
 
  However, I'm drawing blanks when searching for an established term for
 it.
 
  Could we perhaps sidestep the issue by requesting a single descriptive
 paragraph, with punctuation or thereabouts?
 
  
 
  I'm still puzzled about what threw Federico's example in the first
 place. The actual code is
 
  if(strict  !is.na(val - db[Description])
  !grepl([.!?]['\)]?$, trimws(val)))
  out$bad_Description - TRUE
 
  and  I can do this
 
   strict - TRUE
   db - tools:::.read_description(/tmp/dd)
  if(strict  !is.na(val - db[Description])
  + !grepl([.!?]['\)]?$, trimws(val)))
  + out$bad_Description - TRUE
   out
  Error: object 'out' not found
 
  I.e., the complaint should _not_ be triggered. I suppose that something
 like a non-breakable space at the end could confuse trimws(), but beyond
 that I'm out of ideas.
 
 
  On 07 Jul 2015, at 03:28 , John Fox j...@mcmaster.ca wrote:
 
   Dear Peter,
  
   I think that the grammatical term you're looking for is verb phrase.
  
   Best,
   John
  
   On Tue, 7 Jul 2015 00:12:25 +0200
   peter dalgaard pda...@gmail.com wrote:
  
   On 06 Jul 2015, at 23:19 , Duncan Murdoch murdoch.dun...@gmail.com
 wrote:
  
   On 06/07/2015 5:09 PM, Rolf Turner wrote:
   On 07/07/15 07:10, William Dunlap wrote:
  
   [Rolf Turner wrote.]
  
   The CRAN guidelines should be rewritten so that they say what
 they *mean*.
   If a complete sentence is not actually required --- and it seems
 abundantly clear
   that it is not --- then guidelines should not say so.  Rather
 they should say,
   clearly and comprehensibly, what actually *is* required.
  
   This may be true, but also think of the user when you write the
 description.
   If you are scanning a long list of descriptions looking for a
 package to
   use,
   seeing a description that starts with 'A package for' just slows
 you down.
   Seeing a description that includes 'designed to' leaves you
 wondering if the
   implementation is woefully incomplete.  You want to go beyond what
 CRAN
   can test for.
  
   All very true and sound and wise, but what has this got to do with
   complete sentences?  The package checker issues a message saying
 that it
   wants a complete sentence when this has nothing to do with what it
   *really* wants.
  
   That's false.  If you haven't given a complete sentence, you might
 still
   pass, but if you have, you will pass.  That's not nothing to do
 with
   what it really wants, it's just an imperfect test that fails to
 detect
   violations of the guidelines.
  
   As we've seen, it sometimes also makes mistakes in the other
 direction.
   I'd say those are more serious.
  
   Duncan Murdoch
  
  
   Ackchewly
  
   I don't think what we want is what we say that we want. A quick check
 suggests that many/most packages use headline speech, as in Provides
 functions for analysis of foo, with special emphasis on bar., which seems
 perfectly ok.  As others have

Re: [R] Caret and custom summary function

2015-05-11 Thread Max Kuhn

The version of caret just put on CRAN has a function called mnLogLoss that
does this.

Max

On Mon, May 11, 2015 at 11:17 AM, Lorenzo Isella lorenzo.ise...@gmail.com
wrote:

 Dear All,
 I am trying to implement my own metric (a log loss metric) for a
 binary classification problem in Caret.
 I must be making some mistake, because I cannot get anything sensible
 out of it.
 I paste below a numerical example which should run in more or less one
 minute on any laptop.
 When I run it, I finally have an output of the kind




 Aggregating results
 Something is wrong; all the LogLoss metric values are missing:
LogLoss
 Min.   : NA
  1st Qu.: NA
   Median : NA
Mean   :NaN
  3rd Qu.: NA
   Max.   : NA
NA's   :40
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info =
trainInfo,  :
  There were missing values in resampled performance
  measures.




 Any suggestion is appreciated.
 Many thanks

 Lorenzo





 เเ

 library(caret)
 library(C50)


 LogLoss - function (data, lev = NULL, model = NULL)
 {
probs - pmax(pmin(as.numeric(data$T), 1 - 1e-15), 1e-15)
logPreds - log(probs)
 log1Preds - log(1 - probs)
 real - (as.numeric(data$obs) - 1)
 out - c(mean(real * logPreds + (1 - real) *
 log1Preds)) * -1
 names(out) - c(LogLoss)
 out
 }






 train - matrix(ncol=5,nrow=200,NA)

 train - as.data.frame(train)
 names(train) - c(donation, x1,x2,x3,x4)

 set.seed(134)

 sel - sample(nrow(train), 0.5*nrow(train))


 train$donation[sel] - yes
 train$donation[-sel] - no

 train$x1 - seq(nrow(train))
 train$x2 - rnorm(nrow(train))
 train$x3 - 1/train$x1
 train$x4 - sample(nrow(train))

 train$donation - as.factor(train$donation)

 c50Grid - expand.grid(trials = 1:10,
 model = c( tree ,rules
 ),winnow = c(TRUE,
  FALSE ))





 tc - trainControl(method = repeatedCV, summaryFunction=LogLoss,
   number = 10, repeats = 10, verboseIter=TRUE,
   classProbs=TRUE)


 model - train(donation~., data=train, method=C5.0, trControl=tc,
   metric=LogLoss, maximize=FALSE, tuneGrid=c50Grid)




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Repeated failures to install caret package (of Max Kuhn)

2015-04-04 Thread Max Kuhn

I thought that this might be relevant:

https://stackoverflow.com/questions/28985759/cant-install-the-caret-package-in-r-in-my-linux-machine

but it seems that you installed nloptr.

I would also suggest doing the install in base R and trying a different
mirror. I would avoid installing via RStudio unless you have just started a
new R session.

On Sat, Apr 4, 2015 at 11:11 AM, John Kane jrkrid...@inbox.com wrote:

Try installing from somewhere outside of RStudio or reboot and retry in
RStudio. I find that if RStudio is open for a long time I occasionally get
some weird (buggy?) results but I cannot reproduce to send in an bug report.

Load R and from the command line or Windows RGui try installing. As a
test I just installed it successully with the command
install.packages(caret) executed in R (using gedit with its
R-plug-in) and running Ubuntu 14.04

For future reference:
Reproducibility
https://github.com/hadley/devtools/wiki/Reproducibility

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

John Kane
Kingston ON Canada

-Original Message-
From: wyl...@ischool.utexas.edu
Sent: Fri, 03 Apr 2015 16:07:57 -0500
To: r-help@r-project.org
Subject: [R] Repeated failures to install caret package (of Max Kuhn)

For an edx course, MIT's The Analtics Edge, I need to install the
caret package that was originated and is maintained by Dr. Max Kuhn of
Pfizer. So far, every effort I've made to try to
install.packages(caret) has failed. (I'm using R v. 3.1.3 and RStudio
v. 0.98.1103 in LinuxMint 17.1)

Here are some of the things I've tried unsuccessfully:
install.packages(caret, repos=c(http://rstudio.org/_packages;,
http://cran.rstudio.com;))
install.packages(caret, dependencies=TRUE)
install.packages(caret, repos=c(http://rstudio.org/_packages;,
http://cran.rstudio.com;), dependencies=TRUE)
install.packages(caret, dependencies = c(Depends, Suggests))
install.packages(caret, repos=http://cran.rstudio.com/;)

I've changed my CRAN mirror from UCLA to Revolution Analytics in Dallas,
and tried the above installs again, unsuccessfully.

I've succeeded in individually installing a number of packages on which
caret appears to be dependent. Specifically, I've been able to
install nloptr, minqa, Rcpp, reshape2, stringr, and
scales. But I've had no success with trying to do individual installs
of BradleyTerry2, car, lme4, quantreg, and RcppEigen.

Any suggestions will be very gratefully received (and tried out quickly).

Thanks in advance.

Ron Wyllys

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks orcas on
your desktop!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] #library(CHAID) - Cross validation for chaid

2015-01-05 Thread Max Kuhn

You can create your own:

   http://topepo.github.io/caret/custom_models.html

I put a prototype together. Source this file:

   https://github.com/topepo/caret/blob/master/models/files/chaid.R

then try this:

library(CHAID)

### fit tree to subsample
set.seed(290875)
USvoteS - USvote[sample(1:nrow(USvote), 1000),]


## You probably don't want to use `train.formula` as
## it will convert the factors to dummy variables
mod - train(x = USvoteS[,-1], y = USvoteS$vote3,
 method = modelInfo,
 trControl = trainControl(method = cv))

Max

On Mon, Jan 5, 2015 at 7:11 AM, Rodica Coderie via R-help
r-help@r-project.org wrote:
 Hello,

 Is there an option of cross validation for CHAID decision tree? An example of 
 CHAID is below:
 library(CHAID)
 example(chaid, package = CHAID)

 How can I use a 10 fold cross-validation for CHAID?
 I've read that caret package is to cross-validate on many times of models, 
 but model CHAID is not in caret's built-in library.

 library(caret)
 model - train(vote3 ~., data = USvoteS, method='CHAID', 
 tuneLength=10,trControl=trainControl(method='cv', number=10, classProbs=TRUE, 
 summaryFunction=twoClassSummary))

 Thanks,
 Rodica

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with caret, please

2014-10-11 Thread Max Kuhn

What you are asking is a bad idea on multiple levels. You will grossly
over-estimate the area under the ROC curve. Consider the 1-NN model: you
will have perfect predictions every time.

To do this, you will need to run train again and modify the index and
indexOut objects:

library(caret)

  set.seed(1)
  dat - twoClassSim(200)

  set.seed(2)
  folds - createFolds(dat$Class, returnTrain = TRUE)

  Control - trainControl(method=cv,
  summaryFunction=twoClassSummary,
  classProb=T,
  index = folds,
  indexOut = folds)

  tGrid=data.frame(k=1:100)

  set.seed(3)
  a_bad_idea - train(Class ~ ., data=dat,
  method = knn,
  tuneGrid=tGrid,
  trControl=Control, metric =  ROC)

Max

On Sat, Oct 11, 2014 at 7:58 PM, Iván Vallés Pérez 
ivanvallespe...@gmail.com wrote:

 Hello,

 I am using caret package in order to train a K-Nearest Neigbors algorithm.
 For this, I am running this code:

 Control - trainControl(method=cv, summaryFunction=twoClassSummary,
 classProb=T)

 tGrid=data.frame(k=1:100)

 trainingInfo - train(Formula, data=trainData, method =
 knn,tuneGrid=tGrid,
   trControl=Control, metric =  ROC)
 As you can see, I am interested in obtain the AUC parameter of the ROC.
 This code works good but returns the testing error (which the algorithm
 uses for tuning the k parameter of the model) as the mean of the error of
 the CrossValidation folds. I am interested in return, in addition of the
 testing error, the trainingerror (the mean across each fold of the error
 obtained with the training data). ¿How can I do it?

 Thank you
 [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Training a model using glm

2014-09-17 Thread Max Kuhn

You have not shown all of your code and it is difficult to diagnose the
issue.

I assume that you are using the data from:

   library(AppliedPredictiveModeling)
   data(AlzheimerDisease)

If so, there is example code to analyze these data in that package. See
?scriptLocation.

We have no idea how you got to the `training` object (package versions
would be nice too).

I suspect that Dennis is correct. Try using more normal syntax without the
$ indexing in the formula. I wouldn't say it is (absolutely) wrong but it
doesn't look right either.

Max


On Wed, Sep 17, 2014 at 2:04 PM, Mohan Radhakrishnan 
radhakrishnan.mo...@gmail.com wrote:

 Hi Dennis,

  Why is there that warning ? I think my syntax is
 right. Isn't it not? So the warning can be ignored ?

 Thanks,
 Mohan

 On Wed, Sep 17, 2014 at 9:48 PM, Dennis Murphy djmu...@gmail.com wrote:

  No reproducible example (i.e., no data) supplied, but the following
  should work in general, so I'm presuming this maps to the caret
  package as well. Thoroughly untested.
 
  library(caret)# something you failed to mention
 
  ...
  modelFit - train(diagnosis ~ ., data = training1)# presumably a
  logistic regression
  confusionMatrix(test1$diagnosis, predict(modelFit, newdata = test1,
  type = response))
 
  For GLMs, there are several types of possible predictions. The default
  is 'link', which associates with the linear predictor. caret may have
  a different syntax so you should check its help pages re the supported
  predict methods.
 
  Hint: If a function takes a data = argument, you don't need to specify
  the variables as components of the data frame - the variable names are
  sufficient. You should also do some reading to understand why the
  model formula I used is correct if you're modeling one variable as
  response and all others in the data frame as covariates.
 
  Dennis
 
  On Tue, Sep 16, 2014 at 11:15 PM, Mohan Radhakrishnan
  radhakrishnan.mo...@gmail.com wrote:
   I answered this question which was part of the online course correctly
 by
   executing some commands and guessing.
  
   But I didn't get the gist of this approach though my R code works.
  
   I have a training and test dataset.
  
   nrow(training)
  
   [1] 251
  
   nrow(testing)
  
   [1] 82
  
   head(training1)
  
  diagnosisIL_11IL_13IL_16   IL_17E IL_1alpha  IL_3
   IL_4
  
   6   Impaired 6.103215 1.282549 2.671032 3.637051 -8.180721 -3.863233
   1.208960
  
   10  Impaired 4.593226 1.269463 3.476091 3.637051 -7.369791 -4.017384
   1.808289
  
   11  Impaired 6.919778 1.274133 2.154845 4.749337 -7.849364 -4.509860
   1.568616
  
   12  Impaired 3.218759 1.286356 3.593860 3.867347 -8.047190 -3.575551
   1.916923
  
   13  Impaired 4.102821 1.274133 2.876338 5.731246 -7.849364 -4.509860
   1.808289
  
   16  Impaired 4.360856 1.278484 2.776394 5.170380 -7.662778 -4.017384
   1.547563
  
IL_5   IL_6 IL_6_Receptor IL_7 IL_8
  
   6  -0.4004776  0.1856864   -0.51727788 2.776394 1.708270
  
   10  0.1823216 -1.53427580.09668586 2.154845 1.701858
  
   11  0.1823216 -1.09654120.35404039 2.924466 1.719944
  
   12  0.3364722 -0.39871860.09668586 2.924466 1.675557
  
   13  0.000  0.4223589   -0.53219115 1.564217 1.691393
  
   16  0.2623643  0.42235890.18739989 1.269636 1.705116
  
   The testing dataset is similar with 13 columns. Number of rows vary.
  
  
   training1 - training[,grepl(^IL|^diagnosis,names(training))]
  
   test1 - testing[,grepl(^IL|^diagnosis,names(testing))]
  
   modelFit - train(training1$diagnosis ~ training1$IL_11 +
  training1$IL_13 +
   training1$IL_16 + training1$IL_17E + training1$IL_1alpha +
  training1$IL_3 +
   training1$IL_4 + training1$IL_5 + training1$IL_6 +
  training1$IL_6_Receptor
   + training1$IL_7 + training1$IL_8,method=glm,data=training1)
  
   confusionMatrix(test1$diagnosis,predict(modelFit, test1))
  
   I get this error when I run the above command to get the confusion
  matrix.
  
   *'newdata' had 82 rows but variables found have 251 rows '*
  
   I thought this was simple. I train a model using the training dataset
 and
   predict using the test dataset and get the accuracy.
  
   Am I missing the obvious here ?
  
   Thanks,
  
   Mohan
  
   [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
 

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version

Re: [R] Use of library(X) in the code of library X.

2014-06-06 Thread Max Kuhn

That is legacy code but there was a good reason back then.

caret is written to use parallel processing via the foreach package.
There were some cases where the worker processes did not load the
required packages (even when I used foreach's .packages argument) so
I would do it explicitly. I don't recall which parallel backend had
the issue.

The more important lesson is that if you want to understand some R
code written by others you'll learn more bad habits than good ones if
you examine my packages…

Max

On Fri, Jun 6, 2014 at 2:42 PM, Duncan Murdoch murdoch.dun...@gmail.com wrote:
 On 06/06/2014 10:26 AM, Bart Kastermans wrote:

 To improve my R skills I try to understand some R code written by others.
 Mostly
 I am looking at the code of packages I use.  Today I looked at the code
 for the
 caret package

 http://cran.r-project.org/src/contrib/caret_6.0-30.tar.gz

 in particular at the file R/adaptive.R

 This file starts with:

 adaptiveWorkflow - function(x, y, wts, info, method, ppOpts, ctrl, lev,
   metric, maximize, testing = FALSE, ...) {
library(caret)
loadNamespace(caret”)

  From ?library and googling I can’t figure out what this code would do.

 Why would you call library(caret) in the caret package?


 I don't know that package, and since adaptiveWorkflow is not documented at
 the user level, I can't tell exactly what the author had in mind.  However,
 code like that could be present for debugging purposes (and is
 unintentionally present in the CRAN copy), or could be intentional.  The
 library(caret) call has the effect of ensuring that the package is on the
 search list.  (It might have been loaded invisibly by another package.)
 This is generally considered to be bad form nowadays; packages should
 function properly without being on the search list.

 I can't think of a situation where loadNamespace() would do anything --- it
 would have been called by library().

 Duncan Murdoch


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cforest sampling methods

2014-03-19 Thread Max Kuhn

You might look at the 'bag' function in the caret package. It will not
do the subsampling of variables at each split but you can bag a tree
and down-sample the data at each iteration. The help page has an
examples bagging ctree (although you might want to play with the tree
depth a little).

Max

On Wed, Mar 19, 2014 at 3:32 PM, Maggie Makar maggieyma...@gmail.com wrote:
 Hi all,

 I've been using the randomForest package and I'm trying to make the switch
 over to party. My problem is that I have an extremely unbalanced outcome
 (only 1% of the data has a positive outcome) which makes resampling methods
 necessary.

 randomForest has a very useful argument that is sampsize which allows me to
 use a balanced subsample to build each tree in my forest. lets say the
 number of positive cases is 100, my forest would look something like this:

 rf-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100,
 100))

 so I use 100 cases and 100 controls to build each individual tree. Can I do
 the same for cforests? I know I can always upsample but I'd rather not.

 I've tried playing around with the weights argument but I'm either not
 getting it right or it's just the wrong thing to use.

 Any advice on how to adapt cforests to datasets with imbalanced outcomes is
 greatly appreciated...



 Thanks!

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how is the model resample performance calculated by caret?

2014-02-28 Thread Max Kuhn

On Fri, Feb 28, 2014 at 1:13 AM, zhenjiang zech xu
zhenjiang...@gmail.com wrote:
 Dear all,

 I did a 5-repeat of 10-fold cross validation using partial least square
 regression model provided by caret package. Can anyone tell me how are the
 values in plsTune$resample calculated? Is that predicted on each hold-out
 set using the model which is trained on the rest data with the optimized
 parameter tuned from previous cross validation?

Yes, those values are the performance estimates across each hold-out
using the final model. There is an option in trainControl() that will
have it return the resamples from all models too.

 So in the following
 example, firstly, 5-repeat of 10-fold cross validation gives 2 for ncomp as
 the best, and then using ncomp of 2 and the training data to build a model
 and then predict the hold-out data with the model to give a RMSE and
 RSQUARE - is what I am thinking true?

It is.

Max



 plsTune
 524 samples
 615 predictors

 Pre-processing: centered, scaled
 Resampling: Cross-Validation (10 fold, repeated 5 times)

 Summary of sample sizes: 472, 472, 471, 471, 471, 471, ...

 Resampling results across tuning parameters:

   ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
   1  16.8  0.434 1.47 0.0616
   2  14.3  0.612 2.21 0.0768
   3  13.5  0.704 6.33 0.145
   4  14.6  0.706 9.29 0.163
   5  15.2  0.703 10.9 0.172
   6  16.5  0.69  13.4 0.181
   7  18.4  0.672 17.8 0.194
   8  200.651 20.4 0.199
   9  20.9  0.634 20.9 0.199
   10 22.1  0.613 22.1 0.197
   11 23.3  0.599 23.8 0.198
   12 240.588 24.7 0.198
   13 24.9  0.572 25.2 0.197
   14 25.8  0.557 26.2 0.194
   15 26.2  0.544 25.8 0.191
   16 26.6  0.532 25.5 0.187

 RMSE was used to select the optimal model using  the one SE rule.
 The final value used for the model was ncomp = 2.

 plsTune$resample
ncomp RMSE  RsquaredResample
 1  2 13.61569 0.6349700 Fold06.Rep4
 2  2 16.02091 0.5808985 Fold05.Rep1
 3  2 12.59985 0.6008357 Fold03.Rep5
 4  2 13.20069 0.6296245 Fold02.Rep3
 5  2 12.43419 0.6560434 Fold04.Rep2
 6  2 15.36510 0.5954177 Fold04.Rep5
 7  2 12.70028 0.6894489 Fold03.Rep2
 8  2 13.34882 0.6468300 Fold09.Rep3
 9  2 14.80217 0.5575010 Fold08.Rep3
 10 2 19.03705 0.4907630 Fold05.Rep4
 11 2 14.26704 0.6579390 Fold10.Rep2
 12 2 13.79060 0.5806663 Fold05.Rep3
 13 2 14.83641 0.5918039 Fold05.Rep2
 14 2 12.48721 0.7011439 Fold01.Rep3
 15 2 14.98765 0.5866102 Fold07.Rep4
 16 2 10.88100 0.7597167 Fold06.Rep1
 17 2 13.60705 0.6321377 Fold08.Rep5
 18 2 13.42618 0.6136031 Fold08.Rep4
 19 2 13.26066 0.6784586 Fold07.Rep1
 20 2 13.20623 0.6812341 Fold03.Rep3
 21 2 18.54275 0.4404729 Fold08.Rep2
 22 2 11.80312 0.7177681 Fold05.Rep5
 23 2 18.56271 0.4661072 Fold03.Rep1
 24 2 13.54879 0.5850439 Fold10.Rep3
 25 2 14.10859 0.5994811 Fold06.Rep5
 26 2 13.68329 0.6701091 Fold01.Rep5
 27 2 16.12123 0.5401200 Fold10.Rep1
 28 2 12.92250 0.6917220 Fold06.Rep3
 29 2 12.94366 0.6400066 Fold06.Rep2
 30 2 12.39889 0.6790578 Fold01.Rep2
 31 2 13.48499 0.6759649 Fold01.Rep1
 32 2 12.52938 0.6728476 Fold03.Rep4
 33 2 16.43352 0.5795160 Fold09.Rep5
 34 2 12.53991 0.6550694 Fold09.Rep4
 35 2 12.78708 0.6304606 Fold08.Rep1
 36 2 13.97559 0.6655688 Fold04.Rep3
 37 2 15.31642 0.5124997 Fold09.Rep2
 38 2 15.24194 0.5324943 Fold09.Rep1
 39 2 12.90107 0.6318960 Fold04.Rep1
 40 2 13.59574 0.6277869 Fold01.Rep4
 41 2 19.73633 0.4154821 Fold07.Rep5
 42 2 12.03759 0.6537381 Fold02.Rep5
 43 2 15.47139 0.5597097 Fold02.Rep4
 44 2 22.55060 0.3816672 Fold07.Rep3
 45 2 14.57875 0.6269560 Fold07.Rep2
 46 2 13.02385 0.6395148 Fold02.Rep2
 47 2 13.81020 0.6116137 Fold02.Rep1
 48 2 13.46100 0.6200828 Fold04.Rep4
 49 2 13.95487 0.6709253 Fold10.Rep5
 50 2 12.65981 0.6606435 Fold10.Rep4

 Best,
 Zhenjiang

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] boxcox alternative

2014-02-24 Thread Max Kuhn

Michael,

On Mon, Feb 24, 2014 at 5:51 AM, Michael Haenlein
haenl...@escpeurope.eu wrote:

 Dear all,

 I am working with a set of variables that are very non-normally
 distributed. To improve the performance of my model, I'm currently applying
 a boxcox transformation to them. While this improves things, the
 performance is still not great.


Are these predictors that you are transforming?

 So my question: Are there any alternatives to boxcox in R? I would need a
 model that estimates the best transformation automatically without input
 from the user since my approach should be flexible enough to deal with any
 kind of distribution. boxcox allows me to do this by picking the lambda
 that leads to the best fit but I wonder whether there are other options
 out there.


If they are predictors, caret has a function called 'preProcess' that
might interest you. See:

   http://caret.r-forge.r-project.org/preprocess.html#trans

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Predictor Importance in Random Forests and bootstrap

2014-01-28 Thread Max Kuhn

I think that the fundamental problem is that you are using the default
value of ntree (500). You should always use at least 1500 and more if n or
p are large.

Also, this link will give you more up-to-date information on that package
and feature selection:

http://caret.r-forge.r-project.org/featureSelection.html

Max


On Tue, Jan 28, 2014 at 5:32 PM, Dimitri Liakhovitski 
dimitri.liakhovit...@gmail.com wrote:

 Here is a great response I got from SO:

 There is an important difference between the two importance measures:
 MeanDecreaseAccuracy is calculated using out of bag (OOB) data,
 MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated
 on observations not used to form that particular tree. In contrast,
 MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are.
 It is calculated using the same data used to fit trees.

 When you bootstrap data, you are creating multiple copies of the same
 observations. Therefore the same observation can be split into two copies,
 one to form a tree, and one treated as OOB and used to calculate accuracy
 measures. Therefore, data that randomForest thinks is OOB for
 MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample,
 making the estimate of MeanDecreaseAccuracy overly optimistic in the
 bootstrap iterations. Gini index is immune to this, because it is not
 relying on evaluating importance on observations different from those used
 to fit the data.

 I suspect what you are trying to do is use the bootstrap to generate
 inference (p-values/confidence intervals) indicating which variables are
 important in the sense that they are actually predictive of your outcome.
 The bootstrap is not appropriate in this context, because Random Forests
 expects that OOB data is truly OOB and this is important for building the
 forest in the first place. In general, bootstrap is not universally
 applicable, and is only useful in cases where it can be shown that the
 parameter you're estimating has nice asymptotic properties and is not
 sensitive to ties in the data. A procedure like Random Forest which
 relies on the availability of OOB data is necessarily sensitive to ties.

 You may want to look at the caret package in R, which uses random forest
 (or one of a set of many other algorithms) inside a cross-validation loop
 to determine which variables are consistently important. See:




 http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf


 On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski 
 dimitri.liakhovit...@gmail.com wrote:

  Thank you, Bert. I'll definitely ask there.
  In the meantime I just wanted to ensure that my R code (my function for
  bootstrap and the bootstrap run) is correct and my abnormal bootstrap
  results are not a function of my erroneous code.
  Thank you!
 
 
 
  On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter gunter.ber...@gene.com
 wrote:
 
  I **think** this kind of methodological issue might be better at SO
  (stats.stackexchange.com).  It's not really about R programming, which
  is the main focus of this list. And yes, I know they do intersect.
  Nevertheless...
 
  Cheers,
  Bert
 
  Bert Gunter
  Genentech Nonclinical Biostatistics
  (650) 467-7374
 
  Data is not information. Information is not knowledge. And knowledge
  is certainly not wisdom.
  H. Gilbert Welch
 
 
 
 
  On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski
  dimitri.liakhovit...@gmail.com wrote:
   Hello!
   Below, I:
   1. Create a data set with a bunch of factors. All of them are
 predictors
   and 'y' is the dependent variable.
   2. I run a classification Random Forests run with predictor
 importance.
  I
   look at 2 measures of importance - MeanDecreaseAccuracy and
  MeanDecreaseGini
   3. I run 2 boostrap runs for 2 Random Forests measures of importance
   mentioned above.
  
   Question: Could anyone please explain why I am getting such a huge
  positive
   bias across the board (for all predictors) for MeanDecreaseAccuracy?
  
   Thanks a lot!
   Dimitri
  
  
   #
   # Creating a a data set:
   #-
  
   N-1000
   myset1-c(1,2,3,4,5)
   probs1a-c(.05,.10,.15,.40,.30)
   probs1b-c(.05,.15,.10,.30,.40)
   probs1c-c(.05,.05,.10,.15,.65)
   myset2-c(1,2,3,4,5,6,7)
   probs2a-c(.02,.03,.10,.15,.20,.30,.20)
   probs2b-c(.02,.03,.10,.15,.20,.20,.30)
   probs2c-c(.02,.03,.10,.10,.10,.25,.40)
   myset.y-c(1,2)
   probs.y-c(.65,.30)
  
   set.seed(1)
   y-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
   set.seed(2)
   a-as.factor(sample(myset1, N, replace = TRUE,probs1a))
   set.seed(3)
   b-as.factor(sample(myset1, N, replace = TRUE,probs1b))
   set.seed(4)
   c-as.factor(sample(myset1, N, replace = TRUE,probs1c))
   set.seed(5)
   d-as.factor(sample(myset2, N, replace = TRUE,probs2a))
   set.seed(6)
   e-as.factor(sample(myset2, N, replace = TRUE,probs2b))

Re: [R] R crashes with memory errors on a 256GB machine (and system shoes only 60GB usage)

2014-01-02 Thread Max Kuhn

Describing the problem would help a lot more. For example, if you were
using some of the parallel processing options in R, this can make extra
copies of objects and drive memory usage up very quickly.

Max


On Thu, Jan 2, 2014 at 3:35 PM, Ben Bolker bbol...@gmail.com wrote:

 Xebar Saram zeltakc at gmail.com writes:

 
  Hi All,
 
  I have a terrible issue i cant seem to debug which is halting my work
  completely. I have R 3.02 installed on a linux machine (arch
 linux-latest)
  which I built specifically for running high memory use models. the system
  is a 16 core, 256 GB RAM machine. it worked well at the start but in the
  recent days i keep getting errors and crashes regarding memory use, such
 as
  cannot create vector size of XXX, not enough memory etc
 
  when looking at top (linux system monitor) i see i barley scrape the 60
 GB
  of ram (out of 256GB)
 
  i really don't know how to debug this and my whole work is halted due to
  this so any help would be greatly appreciated

   I'm very sympathetic, but it will be almost impossible to debug
 this sort of a problem remotely, without a reproducible example.
 The only guess that I can make, if you *really* are running *exactly*
 the same code as you previously ran successfully, is that you might
 have some very large objects hidden away in a saved workspace in a
 .RData file that's being loaded automatically ...

   I would check whether gc(), memory.profile(), etc. give sensible results
 in a clean R session (R --vanilla).

   Ben Bolker

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Variable importance - ANN

2013-12-04 Thread Max Kuhn

If you are using the nnet package, the caret package has a variable
importance method based on Gevrey, M., Dimopoulos, I.,  Lek, S. (2003).
Review and comparison of methods to study the contribution of variables in
artificial neural network models. Ecological Modelling, 160(3), 249-264. It
is based on the estimated weights.

Max


On Wed, Dec 4, 2013 at 6:41 AM, Giulia Di Lauro giulia.dila...@gmail.comwrote:

 Hi everybody,
 I created a neural network for a regression analysis with package ANN, but
 now I need to know which is the significance of each predictor variable in
 explaining the dependent variable. I thought to analyze the weight, but I
 don't know how to do it.

 Thanks in advance,
 Giulia Di Lauro.

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Inconsistent results between caret+kernlab versions

2013-11-17 Thread Max Kuhn

Andrew,

 What I still don't quite understand is which accuracy values from train() I 
 should trust: those using classProbs=T or classProbs=F?

It depends on whether you need the class probabilities and class
predictions to match (which they would if classProbs = TRUE).

Another option is to use a model where this discrepancy does not exist.

  train often crashes with 'memory map' errors!)?

I've never seen that. You should describe it more.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Inconsistent results between caret+kernlab versions

2013-11-15 Thread Max Kuhn

Or not!

The issue with with kernlab.

Background: SVM models do not naturally produce class probabilities. A
secondary model (via Platt) is fit to the raw model output and a
logistic function is used to translate the raw SVM output to
probability-like numbers (i.e. sum to zero, between 0 and 1). In
ksvm(), you need to use the option prob.model = TRUE to get that
second model.

I discovered some time ago that there can be a discrepancy in the
predicted classes that naturally come from the SVM model and those
derived by using the class associated with the largest class
probability. This is most likely do to natural error in the secondary
probability model and should not be unexpected.

That is the case for your data. In you use the same tuning parameters
as those suggested by train() and go straight to ksvm():

 newSVM - ksvm(x = as.matrix(df[,-1]),
+y = df[,1],
+kernel = rbfdot(sigma = svm.m1$bestTune$.sigma),
+C = svm.m1$bestTune$.C,
+prob.model = TRUE)

 predict(newSVM, df[43,-1])
[1] O32078
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
 predict(newSVM, df[43,-1], type = probabilities)
 O27479 O31403O32057O32059 O32060O32078
[1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394
 O32089 O32663 O32668 O32676
[1,] 0.04890477 0.05210836 0.09838892 0.07284396

Note that, based on the probability model, the class with the largest
probability is O32057 (p = 0.24) while the basic SVM model predicts
O32078 (p = 0.16).

Somebody (maybe me) saw this discrepancy and that led to me to follow this rule:

if(prob.model = TRUE) use the class with the maximum probability
   else use the class prediction from ksvm().

Therefore:

 predict(svm.m1, df[43,-1])
[1] O32057
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676

That change occurred between the two caret versions that you tested with.

(On a side note, can also occur with ksvm() and rpart() if
cost-sensitive training is used because the class designation takes
into account the costs but the class probability predictions do not. I
alerted both package maintainers to the issue some time ago.)

HTH,

Max

On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn mxk...@gmail.com wrote:
 I've looked into this a bit and the issue seems to be with caret. I've
 been looking at the svn check-ins and nothing stands out to me as the
 issue so far. The final models that are generated are the same and
 I'll try to figure out the difference.

 Two small notes:

 1) you should set the seed to ensure reproducibility.
 2) you really shouldn't use character stings with all numbers as
 factor levels with caret when you want class probabilities. It should
 give you a warning about this

 Max

 On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby andrewdi...@mac.com wrote:

 I'm using caret to assess classifier performance (and it's great!). However, 
 I've found that my results differ between R2.* and R3.* - reported 
 accuracies are reduced dramatically. I suspect that a code change to kernlab 
 ksvm may be responsible (see version 5.16-24 here: 
 http://cran.r-project.org/web/packages/caret/news.html). I get very 
 different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + 
 kernlab_0.9-19 (see below).

 Can anyone please shed any light on this?

 Thanks very much!


 ### To replicate:

 require(repmis)  # For downloading from https
 df - source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', 
 sep=',')
 require(caret)
 svm.m1 - 
 train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv',
  number=10, repeats=10, classProbs=TRUE))
 svm.m1
 sessionInfo()

 ### Results - R2.15.2

 svm.m1
 1241 samples
7 predictors
   10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, 
 ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’

 No pre-processing
 Resampling: Cross-Validation (10 fold, repeated 10 times)

 Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ...

 Resampling results across tuning parameters:

   C Accuracy  Kappa  Accuracy SD  Kappa SD
   0.25  0.684 0.63   0.0353   0.0416
   0.5   0.729 0.685  0.0379   0.0445
   1 0.756 0.716  0.0357   0.0418

 Tuning parameter ‘sigma’ was held constant at a value of 0.247
 Kappa was used to select the optimal model using  the largest value.
 The final values used for the model were C = 1 and sigma = 0.247.
 sessionInfo()
 R version 2.15.2 (2012-10-26)
 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

 locale:
 [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
  [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-17  repmis_0.2.4
 caret_5.15-61   reshape2_1.2.2  plyr_1.8lattice_0.20-10 
 foreach_1.4.0   cluster_1.14.3

 loaded

Re: [R] C50 Node Assignment

2013-11-09 Thread Max Kuhn

There is a sub-object called 'rules' that has the output of C5.0 for this model:

 library(C50)
 mod - C5.0(Species ~ ., data = iris, rules = TRUE)
 cat(mod$rules)
id=See5/C5.0 2.07 GPL Edition 2013-11-09
entries=1
rules=4 default=setosa
conds=1 cover=50 ok=50 lift=2.94231 class=setosa
type=2 att=Petal.Length cut=1.9 result=
conds=3 cover=48 ok=47 lift=2.88 class=versicolor
type=2 att=Petal.Length cut=1.9 result=
type=2 att=Petal.Length cut=4.901 result=
type=2 att=Petal.Width cut=1.7 result=
conds=1 cover=46 ok=45 lift=2.875 class=virginica
type=2 att=Petal.Width cut=1.7 result=
conds=1 cover=46 ok=44 lift=2.8125 class=virginica
type=2 att=Petal.Length cut=4.901 result=

You would either have to parse this or parse the summary results:

 summary(mod)

Call:
C5.0.formula(formula = Species ~ ., data = iris, rules = TRUE)

snip
Rules:

Rule 1: (50, lift 2.9)
Petal.Length = 1.9
-  class setosa  [0.981]

Rule 2: (48/1, lift 2.9)
Petal.Length  1.9
Petal.Length = 4.9
Petal.Width = 1.7
-  class versicolor  [0.960]
snip

Max

On Sat, Nov 9, 2013 at 1:11 PM, Carl Witthoft c...@witthoft.com wrote:

 Just to clarify:  I'm guessing the OP is referring to the CRAN package C50
 here.   A quick skim suggests the rules are a list element of a C5.0-class
 object, so maybe that's where to start?


 David Winsemius wrote
 In my role as a moderator I am attempting to bypass the automatic mail
 filters that are blocking this posting. Please reply to the list and to:
 =
 Kevin Shaney lt;

 kevin.shaney@

 gt;

 C50 Node Assignment

 I am using C50 to classify individuals into 5 groups / categories (factor
 variable).  The tree / set of rules has 10 rules for classification.  I am
 trying to extract the RULE for which each individual qualifies (a number
 between 1 and 10), and cannot figure out how to do so.  I can extract the
 predicted group and predicted group probability, but not the RULE to which
 an individual qualifies.  Please let me know if you can help!

 Kevin
 =


 --
 David Winsemius
 Alameda, CA, USA

 __

 R-help@

  mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





 --
 View this message in context: 
 http://r.789695.n4.nabble.com/C50-Node-Assignment-tp4680071p4680127.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cross validation in R

2013-07-02 Thread Max Kuhn

 How do i make a loop so that the process could be repeated several time,
 producing randomly ROC curve and under ROC values?


Using the caret package

http://caret.r-forge.r-project.org/

--

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Error running caret's gbm train function with new version of caret

2013-05-06 Thread Max Kuhn

Katrina,

I made some changes to accomidate gbm's new feature for 3+ categories,
then had to harmonize how gbm and caret work together.

I have a new version of caret that is not released yet (maybe within a
month), but you should get it from:

   install.packages(caret, repos=http://R-Forge.R-project.org;)

You may also need to ungrade gbm. That package page is:

   https://code.google.com/p/gradientboostedmodels/downloads/list

Let me know if you have any issues.

Max

On Sat, May 4, 2013 at 5:33 PM, Katrina Bennett kebenn...@alaska.edu wrote:
 I am running caret for model exploration. I developed my code a number of
 months ago and I've been running it with no issues. Recently, I updated my
 version of caret however, and now I am getting a new error. I'm wondering
 if this is due to the new release.

 The error I am getting is when I am running GBM.

 print(paste(calculating GBM for, i))
 #gbm runs over and over again
 set.seed(1)
 trainModelGBM - train(trainClass3, trainAsym, gbm, metric=RMSE,
 tuneLength = 5, trControl = con)

 The error I am getting is at the end of the run once all the iterations
 have been processed:
 Error in { :
   task 1 failed - arguments imply differing number of rows: 5, 121

 trainClass3 and trainAsym have 311 values in them. I'm using 5 variables in
 my matrix. I'm not sure where the 117 is coming from.

 I found solutions online that suggested that updated the version of glmnet,
 Matrix and doing something with cv.folds would work. None of these
 solutions have worked for me.

 Here is my R session info.

 R version 2.15.1 (2012-06-22)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 caret version 5.15-61

 Thank you,

 Katrina

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] C50 package in R

2013-04-26 Thread Max Kuhn

There isn't much out there. Quinlan didn't open source the code until about
a year ago.

I've been through the code line by line and we have a fairly descriptive
summary of the model in our book (that's almost out):

  http://appliedpredictivemodeling.com/

I will say that the pruning is mostly the same as described in Quinlan's
C4.5 book. The big differences in C4.5 and C5.0 are boosting and winnowing.
The former is very different mechanically than gradient boosting machines
and is more similar to the re-weighting approach of the original adaboost
algorithm (but is still pretty different).

I've submitted a talk on C5.0 for this year's UseR! conference. If there is
enough time I will be able to go through some of the technical details.

Two other related notes:

- the J48 implementation in Weka lacks one or two of C4.5's features that
makes the results substantially different than what C4.5 would have
produced  The differences are significant enough that Quinlan asked us to
call the results of that function as J48 and not C4.5. Using C5.0 with
a single tree is much similar to C4.5 than J48.

- the differences between model trees and Cubist are also substantial and
largely undocumented.

HTH,

Max




On Thu, Apr 25, 2013 at 9:40 AM, Indrajit Sen Gupta 
indrajit...@rediffmail.com wrote:

 Hi All,



 I am trying to use the C50 package to build classification trees in R.
 Unfortunately there is not enought documentation around its use. Can anyone
 explain to me - how to prune the decision trees?



 Regards,

 Indrajit


 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Some questions about potential formatting options

2013-04-17 Thread Max Kuhn

Paul,

#1: I've never tried but you might be able to escape the required tags in
your text (e.g. in html you could write out the b in your text).

#3: Which output? Is this in text?

#2: I may be possible and maybe easy to implement. So if you want to dig
into it, have at it. For me, I'm completely buried in
the foreseeable future and won't be able to pay much attention to it. To be
honest, odfWeave has been fairly neglected by me and lately I've had
thoughts of orphaning the package :-/

Thanks,

Max



On Tue, Apr 16, 2013 at 1:15 PM, Paul Miller pjmiller...@yahoo.com wrote:

 Hi Milan and Max,

 Thanks to each of you for your reply to my post. Thus far, I've managed to
 find answers to some of the questions I asked initially.

 I am now able to control the justification of the leftmost column in my
 tables, as well as to add borders to the top and bottom. I also downloaded
 Milan's revised version of odfWeave at the link below, and found that it
 does a nice job of controlling column widths.

 http://nalimilan.perso.neuf.fr/transfert/odfWeave.tar.gz

 There are some other things I'm still struggling with though.

 1. Is it possible to get odfTableCaption and odfFigureCaption to make the
 titles they produce bold? I understand it might be possible to accomplish
 this by changing something in the styles but am not sure what. If someone
 can give me a hint, I can likely do the rest.

 2. Is there any way to get odfFigureCaption to put titles at the top of
 the figure instead of the bottom? I've noticed that odfTableCaption is able
 to do this but apparently not odfFigureCaption.

 3. Is it possible to add special characters to the output? Below is a
 sample Kaplan-Meier analysis. There's a footnote in there that reads Note:
 X2(1) = xx.xx, p = .. Is there any way to make the X a lowercase Chi
 and to superscript the 2? I did quite a bit of digging on this topic. It
 sounds like it might be difficult, especially if one is using Windows as I
 am.

 Thanks,

 Paul

 ##
  Get data 
 ##

  Load packages 

 require(survival)
 require(MASS)

  Sample analysis 

 attach(gehan)
 gehan.surv - survfit(Surv(time, cens) ~ treat, data= gehan, conf.type =
 log-log)
 print(gehan.surv)

 survTable - summary(gehan.surv)$table
 survTable - data.frame(Treatment = rownames(survTable), survTable,
 row.names=NULL)
 survTable - subset(survTable, select = -c(records, n.max))

 ##
  odfWeave 
 ##

  Load odfWeave 

 require(odfWeave)

  Modify StyleDefs 

 currentDefs - getStyleDefs()

 currentDefs$firstColumn$type - Table Column
 currentDefs$firstColumn$columnWidth - 5 cm
 currentDefs$secondColumn$type - Table Column
 currentDefs$secondColumn$columnWidth - 3 cm

 currentDefs$ArialCenteredBold$fontSize - 10pt
 currentDefs$ArialNormal$fontSize - 10pt
 currentDefs$ArialCentered$fontSize - 10pt
 currentDefs$ArialHighlight$fontSize - 10pt

 currentDefs$ArialLeftBold - currentDefs$ArialCenteredBold
 currentDefs$ArialLeftBold$textAlign - left

 currentDefs$cgroupBorder - currentDefs$lowerBorder
 currentDefs$cgroupBorder$topBorder - 0.0007in solid #00

 setStyleDefs(currentDefs)

  Modify ImageDefs 

 imageDefs - getImageDefs()
 imageDefs$dispWidth - 5.5
 imageDefs$dispHeight- 5.5
 setImageDefs(imageDefs)

  Modify Styles 

 currentStyles - getStyles()
 currentStyles$figureFrame - frameWithBorders
 setStyles(currentStyles)

  Set odt table styles 

 tableStyles - tableStyles(survTable, useRowNames = FALSE, header = )
 tableStyles$headerCell[1,] - cgroupBorder
 tableStyles$header[,1] - ArialLeftBold
 tableStyles$text[,1] - ArialNormal
 tableStyles$cell[2,] - lowerBorder

  Weave odt source file 

 fp - N:/Studies/HCRPC1211/Report/odfWeaveTest/
 inFile - paste(fp, testWeaveIn.odt, sep=)
 outFile - paste(fp, testWeaveOut.odt, sep=)
 odfWeave(inFile, outFile)

 ##
  Contents of .odt source file 
 ##

 Here is a sample Kaplan-Meier table.

 testKMTable, echo=FALSE, results = xml=
 odfTableCaption(A Sample Kaplan-Meier Analysis Table)
 odfTable(survTable, useRowNames = FALSE, digits = 3,
 colnames = c(Treatment, Number, Events, Median, 95% LCL, 95%
 UCL),
 colStyles = c(firstColumn, secondColumn, secondColumn,
 secondColumn, secondColumn, secondColumn),
 styles = tableStyles)
 odfCat(Note: X2(1) = xx.xx, p = .)
 @

 Here is a sample Kaplan-Meier graph.

 testKMFig, echo=FALSE, fig = TRUE=
 odfFigureCaption(A Sample Kaplan-Meier Analysis Graph, label = Figure)
 plot(gehan.surv, xlab = Time, ylab= Survivorship)
 @





-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal,

Re: [R] Parallelizing GBM

2013-03-24 Thread Max Kuhn

See this:

   https://code.google.com/p/gradientboostedmodels/issues/detail?id=3

and this:


https://code.google.com/p/gradientboostedmodels/source/browse/?name=parallel


Max


On Sun, Mar 24, 2013 at 7:31 AM, Lorenzo Isella lorenzo.ise...@gmail.comwrote:

 Dear All,
 I am far from being a guru about parallel programming.
 Most of the time, I rely or randomForest for data mining large datasets.
 I would like to give a try also to the gradient boosted methods in GBM,
 but I have a need for parallelization.
 I normally rely on gbm.fit for speed reasons, and I usually call it this
 way



 gbm_model - gbm.fit(trainRF,prices_train,
 offset = NULL,
 misc = NULL,
 distribution = multinomial,
 w = NULL,
 var.monotone = NULL,
 n.trees = 50,
 interaction.depth = 5,
 n.minobsinnode = 10,
 shrinkage = 0.001,
 bag.fraction = 0.5,
 nTrain = (n_train/2),
 keep.data = FALSE,
 verbose = TRUE,
 var.names = NULL,
 response.name = NULL)


 Does anybody know an easy way to parallelize the model (in this case it
 means simply having 4 cores on the same machine working on the problem)?
 Any suggestion is welcome.
 Cheers

 Lorenzo

 __**
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/**
 posting-guide.html http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CARET and NNET fail to train a model when the input is high dimensional

2013-03-06 Thread Max Kuhn

James,

I did a fresh install from CRAN to get caret_5.15-61 and ran your code with
method.name = nnet and grid.len = 3.

I don't get an error, although there were issues:

   In nominalTrainWorkflow(dat = trainData, info = trainInfo,  ... :
 There were missing values in resampled performance measures.

The results had:

Resampling results across tuning parameters:

  size  decay  ROCSens   Spec   ROC SD   Sens SD  Spec SD
  1 0  0.521  0.52   0.521  0.0148   0.0312   0.00901
  1 1e-04  0.513  0.528  0.498  0.00616  0.00386  0.00552
  1 0.10.515  0.522  0.514  0.0169   0.0284   0.0426
  3 0  NaNNaNNaNNA   NA   NA
  3 1e-04  NaNNaNNaNNA   NA   NA
  3 0.1NaNNaNNaNNA   NA   NA
  5 0  NaNNaNNaNNA   NA   NA
  5 1e-04  NaNNaNNaNNA   NA   NA
  5 0.1NaNNaNNaNNA   NA   NA

To test more, I ran:

test - nnet(trX, trY, size = 3, decay = 0)
   Error in nnet.default(trX, trY, size = 3, decay = 0) :
 too many (2107) weights

So, you need to pass in MaxNWts to nnet() with a value that let's you fit
the model. Off the top of my head, you could use something like:

   MaxNWts  = length(levels(trY))*(max(my.grid$.size) * (nCol + 1) +
max(my.grid$.size) + 1)

Also, this one of the methods for getting help (the other is to just email
me). I also try to keep up on stack exchange too.

Max



On Tue, Mar 5, 2013 at 9:47 PM, James Jong ribonucle...@gmail.com wrote:

 The following code fails to train a nnet model in a random dataset using
 caret:

 nR - 700
 nCol - 2000
   myCtrl - trainControl(method=cv, number=3, preProcOptions=NULL,
 classProbs = TRUE, summaryFunction = twoClassSummary)
   trX - data.frame(replicate(nR, rnorm(nCol)))
   trY - runif(1)*trX[,1]*trX[,2]^2+runif(1)*trX[,3]/trX[,4]
   trY - as.factor(ifelse(sign(trY)0,'X1','X0'))
   my.grid - createGrid(method.name, grid.len, data=trX)
   my.model - train(trX,trY,method=method.name
 ,trace=FALSE,trControl=myCtrl,tuneGrid=my.grid,
 metric=ROC)
   print(Done)

 The error I get is:
 task 2 failed - arguments imply differing number of rows: 1334, 666

 However, everything works if I reduce nR to, say 20.

 Any thoughts on what may be causing this? Is there a place where I could
 report this bug other than this mailing list?

 Here is my session info:
  sessionInfo()
 R version 2.15.2 (2012-10-26)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 locale:
 [1] C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
 [1] nnet_7.3-5  pROC_1.5.4  caret_5.15-052  foreach_1.4.0
 [5] cluster_1.14.3  plyr_1.8reshape2_1.2.2  lattice_0.20-13

 loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6
 [5] stringr_0.6.2   tools_2.15.2

 Thanks,

 James

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret pls model statistics

2013-03-03 Thread Max Kuhn

That the most common formula, but not the only one. See

  Kvålseth, T. (1985). Cautionary note about $R^2$. *American Statistician*,
*39*(4), 279285.

Traditionally, the symbol 'R' is used for the Pearson correlation
coefficient and one way to calculate R^2 is... R^2.

Max


On Sun, Mar 3, 2013 at 3:16 PM, Charles Determan Jr deter...@umn.eduwrote:

 I was under the impression that in PLS analysis, R2 was calculated by 1-
 (Residual sum of squares) / (Sum of squares).  Is this still what you are
 referring to?  I am aware of the linear R2 which is how well two variables
 are correlated but the prior equation seems different to me.  Could you
 explain if this is the same concept?

 Charles


 On Sun, Mar 3, 2013 at 12:46 PM, Max Kuhn mxk...@gmail.com wrote:

  Is there some literature that you make that statement?

 No, but there isn't literature on changing a lightbulb with a duck either.

  Are these papers incorrect in using these statistics?

 Definitely, if they convert 3+ categories to integers (but there are
 specialized R^2 metrics for binary classification models). Otherwise, they
 are just using an ill-suited score.

  How would you explain such an R^2 value to someone? R^2 is
 a function of correlation between the two random variables. For two
 classes, one of them is binary. What does it mean?

 Historically, models rooted in computer science (eg neural networks) used
 RMSE or SSE to fit models with binary outcomes and that *can* work work
 well.

 However, I don't think that communicating R^2 is effective. Other metrics
 (e.g. accuracy, Kappa, area under the ROC curve, etc) are designed to
 measure the ability of a model to classify and work well. With 3+
 categories, I tend to use Kappa.

 Max




 On Sun, Mar 3, 2013 at 10:53 AM, Charles Determan Jr deter...@umn.eduwrote:

 Thank you for your response Max.  Is there some literature that you make
 that statement?  I am confused as I have seen many publications that
 contain R^2 and Q^2 following PLSDA analysis.  The analysis usually is to
 discriminate groups (ie. classification).  Are these papers incorrect in
 using these statistics?

 Regards,
 Charles


 On Sat, Mar 2, 2013 at 10:39 PM, Max Kuhn mxk...@gmail.com wrote:

 Charles,

 You should not be treating the classes as numeric (is virginica really
 three times setosa?). Q^2 and/or R^2 are not appropriate for 
 classification.

 Max


 On Sat, Mar 2, 2013 at 5:21 PM, Charles Determan Jr 
 deter...@umn.eduwrote:

 I have discovered on of my errors.  The timematrix was unnecessary and
 an
 unfortunate habit I brought from another package.  The following
 provides
 the same R2 values as it should, however, I still don't know how to
 retrieve Q2 values.  Any insight would again be appreciated:

 library(caret)
 library(pls)

 data(iris)

 #needed to convert to numeric in order to do regression
 #I don't fully understand this but if I left as a factor I would get an
 error following the summary function
 iris$Species=as.numeric(iris$Species)
 inTrain1=createDataPartition(y=iris$Species,
 p=.75,
 list=FALSE)

 training1=iris[inTrain1,]
 testing1=iris[-inTrain1,]

 ctrl1=trainControl(method=cv,
 number=10)

 plsFit2=train(Species~.,
 data=training1,
 method=pls,
 trControl=ctrl1,
 metric=Rsquared,
 preProc=c(scale))

 data(iris)
 training1=iris[inTrain1,]
 datvars=training1[,1:4]
 dat.sc=scale(datvars)

 pls.dat=plsr(as.numeric(training1$Species)~dat.sc,
 ncomp=3, method=oscorespls, data=training1)

 x=crossval(pls.dat, segments=10)

 summary(x)
 summary(plsFit2)

 Regards,
 Charles

 On Sat, Mar 2, 2013 at 3:55 PM, Charles Determan Jr deter...@umn.edu
 wrote:

  Greetings,
 
  I have been exploring the use of the caret package to conduct some
 plsda
  modeling.  Previously, I have come across methods that result in a
 R2 and
  Q2 for the model.  Using the 'iris' data set, I wanted to see if I
 could
  accomplish this with the caret package.  I use the following code:
 
  library(caret)
  data(iris)
 
  #needed to convert to numeric in order to do regression
  #I don't fully understand this but if I left as a factor I would get
 an
  error following the summary function
  iris$Species=as.numeric(iris$Species)
  inTrain1=createDataPartition(y=iris$Species,
  p=.75,
  list=FALSE)
 
  training1=iris[inTrain1,]
  testing1=iris[-inTrain1,]
 
  ctrl1=trainControl(method=cv,
  number=10)
 
  plsFit2=train(Species~.,
  data=training1,
  method=pls,
  trControl=ctrl1,
  metric=Rsquared,
  preProc=c(scale))
 
  data(iris)
  training1=iris[inTrain1,]
  datvars=training1[,1:4]
  dat.sc=scale(datvars)
 
  n=nrow(dat.sc)
  dat.indices=seq(1,n)
 
  timematrix=with(training1,
  classvec2classmat(Species[dat.indices]))
 
  pls.dat=plsr(timematrix ~ dat.sc,
  ncomp=3, method=oscorespls, data=training1)
 
  x=crossval(pls.dat, segments=10)
 
  summary(x)
  summary(plsFit2)
 
  I see two different R2 values and I cannot figure out

Re: [R] caret pls model statistics

2013-03-02 Thread Max Kuhn

Charles,

You should not be treating the classes as numeric (is virginica really
three times setosa?). Q^2 and/or R^2 are not appropriate for classification.

Max


On Sat, Mar 2, 2013 at 5:21 PM, Charles Determan Jr deter...@umn.eduwrote:

 I have discovered on of my errors.  The timematrix was unnecessary and an
 unfortunate habit I brought from another package.  The following provides
 the same R2 values as it should, however, I still don't know how to
 retrieve Q2 values.  Any insight would again be appreciated:

 library(caret)
 library(pls)

 data(iris)

 #needed to convert to numeric in order to do regression
 #I don't fully understand this but if I left as a factor I would get an
 error following the summary function
 iris$Species=as.numeric(iris$Species)
 inTrain1=createDataPartition(y=iris$Species,
 p=.75,
 list=FALSE)

 training1=iris[inTrain1,]
 testing1=iris[-inTrain1,]

 ctrl1=trainControl(method=cv,
 number=10)

 plsFit2=train(Species~.,
 data=training1,
 method=pls,
 trControl=ctrl1,
 metric=Rsquared,
 preProc=c(scale))

 data(iris)
 training1=iris[inTrain1,]
 datvars=training1[,1:4]
 dat.sc=scale(datvars)

 pls.dat=plsr(as.numeric(training1$Species)~dat.sc,
 ncomp=3, method=oscorespls, data=training1)

 x=crossval(pls.dat, segments=10)

 summary(x)
 summary(plsFit2)

 Regards,
 Charles

 On Sat, Mar 2, 2013 at 3:55 PM, Charles Determan Jr deter...@umn.edu
 wrote:

  Greetings,
 
  I have been exploring the use of the caret package to conduct some plsda
  modeling.  Previously, I have come across methods that result in a R2 and
  Q2 for the model.  Using the 'iris' data set, I wanted to see if I could
  accomplish this with the caret package.  I use the following code:
 
  library(caret)
  data(iris)
 
  #needed to convert to numeric in order to do regression
  #I don't fully understand this but if I left as a factor I would get an
  error following the summary function
  iris$Species=as.numeric(iris$Species)
  inTrain1=createDataPartition(y=iris$Species,
  p=.75,
  list=FALSE)
 
  training1=iris[inTrain1,]
  testing1=iris[-inTrain1,]
 
  ctrl1=trainControl(method=cv,
  number=10)
 
  plsFit2=train(Species~.,
  data=training1,
  method=pls,
  trControl=ctrl1,
  metric=Rsquared,
  preProc=c(scale))
 
  data(iris)
  training1=iris[inTrain1,]
  datvars=training1[,1:4]
  dat.sc=scale(datvars)
 
  n=nrow(dat.sc)
  dat.indices=seq(1,n)
 
  timematrix=with(training1,
  classvec2classmat(Species[dat.indices]))
 
  pls.dat=plsr(timematrix ~ dat.sc,
  ncomp=3, method=oscorespls, data=training1)
 
  x=crossval(pls.dat, segments=10)
 
  summary(x)
  summary(plsFit2)
 
  I see two different R2 values and I cannot figure out how to get the Q2
  value.  Any insight as to what my errors may be would be appreciated.
 
  Regards,
 
  --
  Charles
 



 --
 Charles Determan
 Integrated Biosciences PhD Student
 University of Minnesota

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Trouble Getting the Package to Work

2013-02-18 Thread Max Kuhn

That's not a reproducible example. There is no sessionInfo() and you
omitted code (where did 'fp' come from?).

It works fine for me (see sessionInfo below) using the code in ?odfWeave.

As for the file paths: you can point to different paths for the files
(although don't change the working directory in the odt file). If you read
the documentation for workDir: a path to a directory where the source file
will be unpacked and processed. If it does not exist, it will be created.
If it exists, it should be empty, since all its contents will be included
in the generated file. The default value should be sufficient.

Max

 sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid  stats graphics  grDevices utils datasets  methods
base

other attached packages:
[1] MASS_7.3-22 odfWeave_0.8.2  XML_3.95-0.1lattice_0.20-10

loaded via a namespace (and not attached):
[1] tools_2.15.2




On Mon, Feb 18, 2013 at 8:52 AM, Paul Miller pjmiller...@yahoo.com wrote:

 Hello All,

 Have recently started learning Sweave and Knitr. Am now trying to learn
 odfWeave as well. Things went pretty smoothly with Sweave and Knitr but I'm
 having some trouble with odfWeave.

 My understanding was that odfWeave should work in pretty much the same way
 as Sweave. With odfWeave, you set up an input .odt file in a folder, run
 that file through the odfWeave function, and then the function produces an
 output .odt file in the same folder.

 So I decided to try that using a file called simple.odt that comes with
 the odfWeave package. Unfortunately, things didn't work out quite as I had
 hoped. Below is the result of my attempt to odfWeave that file via Emacs.

 For some reason, odfWeave is setting the wd to a location on the C drive
 when my input file is on the N drive. I tried altering this by setting the
 location of workDir to my folder on the N drive. odfWeave through up an
 error saying that this folder already exists. So perhaps the files are
 supposed to be processed in a location other than the one where the input
 file resides.

 The other thing is that odfWeave is finding an unexpected . There is
 text in the simple.odt input file that looks like
 paste(levels(iris$Species), collapse =  but it has no . So presumably
 something is wrong in the xml markup that is being produced.

 If anyone can help me understand what is going wrong here, that would be
 greatly appreciated.

 Thanks,

 Paul

  library(odfWeave)
 Loading required package: lattice
 Loading required package: XML
  inFile  - paste(fp, simple.odt, sep=)
  outFile - paste(fp, output.odt, sep=)
  odfWeave(inFile, outFile)
   Copying  N:/Studies/HCRPC1211/Documentation/R Documentation/odfWeave
 Documentation/Examples/Example 1/simple.odt
   Setting wd to
 C:\Users\pmiller\AppData\Local\Temp\3\RtmpMlDMHV/odfWeave18071055703
   Unzipping ODF file using unzip -o simple.odt
 Archive:  simple.odt
  extracting: mimetype
   inflating: meta.xml
   inflating: settings.xml
   inflating: content.xml
  extracting: Thumbnails/thumbnail.png
   inflating: layout-cache
   inflating: manifest.rdf
creating: Configurations2/popupmenu/
creating: Configurations2/images/Bitmaps/
creating: Configurations2/toolpanel/
creating: Configurations2/statusbar/
creating: Configurations2/toolbar/
creating: Configurations2/progressbar/
creating: Configurations2/menubar/
creating: Configurations2/floater/
   inflating: Configurations2/accelerator/current.xml
   inflating: styles.xml
   inflating: META-INF/manifest.xml
   Removing  simple.odt
   Creating a Pictures directory
   Pre-processing the contents
   Sweaving  content.Rnw
   Writing to file content_1.xml
   Processing code chunks ...
 Error in parse(text = cmd) : text:1:40: unexpected ''
 1: paste(levels(iris$Species), collapse = 
   ^
 
 [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CARET: Any way to access other tuning parameters?

2013-02-13 Thread Max Kuhn

James,

You really need to read the documentation. Almost every question that you
have has been addressed in the existing material. For this one, there is a
section on custom models here:

   http://caret.r-forge.r-project.org/training.html

Max


On Wed, Feb 13, 2013 at 9:58 AM, James Jong ribonucle...@gmail.com wrote:

 The documentation for caret::train shows a list of parameters that one can
  tune for each method classification/regression method. For example, for
 the method randomForest one can tune mtry in the call to train. But the
  function call to train random forests in the original package has many
 other parameters, e.g. sampsize, maxnodes, etc.

 Is there **any** way to access these parameters using train in caret? (Is
 the function caret::createGrid limited to the list of parameters specified
 in the caret documentation, it's not super clear if the list of parameter
 is for all the caret APIs).

 Thanks,

 James,

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CARET: Any way to access other tuning parameters?

2013-02-13 Thread Max Kuhn

@Max - Thanks a lot for your help. I have already been using that website
 as a reference, and it's incredibly helpful. I have also been experimenting
 with tuneGrid already. My question was specifically if tuneGrid (or caret
 in general) supports passing method parameters to the method functions from
 each package other than those listed in the CARET documentation (e.g. I
 would like to specify sampsize and nodesize for randomForest, and not just
 mtry).


Yes. A custom method is how you do that.


 Thanks,

 James






 On Wed, Feb 13, 2013 at 1:07 PM, Max Kuhn mxk...@gmail.com wrote:

 James,

 You really need to read the documentation. Almost every question that you
 have has been addressed in the existing material. For this one, there is a
 section on custom models here:

http://caret.r-forge.r-project.org/training.html

 Max


 On Wed, Feb 13, 2013 at 9:58 AM, James Jong ribonucle...@gmail.comwrote:

 The documentation for caret::train shows a list of parameters that one
 can
  tune for each method classification/regression method. For example, for
 the method randomForest one can tune mtry in the call to train. But the
  function call to train random forests in the original package has many
 other parameters, e.g. sampsize, maxnodes, etc.

 Is there **any** way to access these parameters using train in caret? (Is
 the function caret::createGrid limited to the list of parameters
 specified
 in the caret documentation, it's not super clear if the list of parameter
 is for all the caret APIs).

 Thanks,

 James,

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --

 Max





-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] pROC and ROCR give different values for AUC

2012-12-19 Thread Max Kuhn

A reproducible example sent to the package maintainer(s)
might yield results.

Max


On Wed, Dec 19, 2012 at 7:47 AM, Ivana Cace i.c...@ati-a.nl wrote:

 Packages pROC and ROCR both calculate/approximate the Area Under (Receiver
 Operator) Curve. However the results are different.

 I am computing a new variable as a predictor for a label. The new variable
 is a (non-linear) function of a set of input values, and I'm checking how
 different parameter settings contribute to prediction. All my settings are
 predictive, but some are better.

 The AUC i got with pROC was much lower then expected, so i tried ROCR.
 Here are some comparisons:
 AUC from pROC AUC from ROCR
 0.49465  0.79311
 0.49465  0.79349
 0.49701  0.79446
 0.49701  0.79764

 When i draw the ROC (with pROC) i get the curve i expect. But why is the
 AUC according to pROC so different?

 Ivana




 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with this error kernlab class probability calculations failed; returning NAs

2012-11-29 Thread Max Kuhn

You didn't provide the results of sessionInfo().

Upgrade to the version just released on cran and see if you still have the
issue.

Max


On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote:

 I have never been able to get class probabilities to work and I am
 relatively new to using these tools, and I am looking for some insight as
 to what may be wrong.

 I am using caret with kernlab/ksvm.  I will simplify my problem to a basic
 data set which produces the same problem.  I have read the caret vignettes
 as well as documentation for ?train.  I appreciate any direction you can
 give.  I realize this is a very small dataset, the actual data is much
 larger, I am just using 10 rows as an example:

 trainset - data.frame(
   outcome=factor(c(0,1,0,1,0,1,1,1,1,0)),
   age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
   amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
 )

  str(trainset)
 'data.frame':   7 obs. of  3 variables:
  $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1
  $ age: num  23 5 28 48 82 11 9
  $ amount : num  22.2 494.2 2 39.2 39.2 ...

  colSums(is.na(trainset))
 outcome age  amount
   0   0   0


 ## SAMPLING AND FORMULA
 dataset - trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 trainset - dataset[-testindex,]
 testset - dataset[testindex,-1]


 ## TUNE caret / kernlab
 set.seed(1)
 MyTrainControl=trainControl(
   method = repeatedcv,
   number=10,
   repeats=5,
   returnResamp = all,
   classProbs = TRUE
 )


 ## MODEL
 rbfSVM - train(outcome~., data = trainset,
method=svmRadial,
preProc = c(scale),
tuneLength = 10,
trControl=MyTrainControl,
fit = FALSE
 )

 There were 50 or more warnings (use warnings() to see the first 50)
  warnings()
 Warning messages:
 1: In train.default(x, y, weights = w, ...) :
   At least one of the class levels are not valid R variables names; This
 may cause errors if class probabilities are generated because the variables
 names will be converted to: X0, X1
 2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,
  ... :
   kernlab class prediction calculations failed; returning NAs

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with this error kernlab class probability calculations failed; returning NAs

2012-11-29 Thread Max Kuhn

Your output has:

At least one of the class levels are not valid R variables names; This may
cause errors if class probabilities are generated because the variables
names will be converted to: X0, X1

Try changing the factor levels to avoid leading numbers and try again.

Max




On Thu, Nov 29, 2012 at 10:18 PM, Brian Feeny bfe...@mac.com wrote:



 Yes I am still getting this error, here is my sessionInfo:

  sessionInfo()
 R version 2.15.2 (2012-10-26)
 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

 locale:
 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
 [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-14  caret_5.15-045
  foreach_1.4.0   cluster_1.14.3
 [7] reshape_0.8.4   plyr_1.7.1  lattice_0.20-10

 loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.2 grid_2.15.2 iterators_1.0.6
 tools_2.15.2


 Is there an example that shows a classProbs example, I could try to run it
 to replicate and see if it works on my system.

 Brian

 On Nov 29, 2012, at 10:10 PM, Max Kuhn mxk...@gmail.com wrote:

 You didn't provide the results of sessionInfo().

 Upgrade to the version just released on cran and see if you still have the
 issue.

 Max


 On Thu, Nov 29, 2012 at 6:55 PM, Brian Feeny bfe...@mac.com wrote:

 I have never been able to get class probabilities to work and I am
 relatively new to using these tools, and I am looking for some insight as
 to what may be wrong.

 I am using caret with kernlab/ksvm.  I will simplify my problem to a
 basic data set which produces the same problem.  I have read the caret
 vignettes as well as documentation for ?train.  I appreciate any direction
 you can give.  I realize this is a very small dataset, the actual data is
 much larger, I am just using 10 rows as an example:

 trainset - data.frame(
   outcome=factor(c(0,1,0,1,0,1,1,1,1,0)),
   age=c(10, 23, 5, 28, 81, 48, 82, 23, 11, 9),
   amount=c(10.11, 22.23, 494.2, 2.0, 29.2, 39.2, 39.2, 39.0, 11.1, 12.2)
 )

  str(trainset)
 'data.frame':   7 obs. of  3 variables:
  $ outcome: Factor w/ 2 levels 0,1: 2 1 2 2 2 2 1
  $ age: num  23 5 28 48 82 11 9
  $ amount : num  22.2 494.2 2 39.2 39.2 ...

  colSums(is.na(trainset))
 outcome age  amount
   0   0   0


 ## SAMPLING AND FORMULA
 dataset - trainset
 index - 1:nrow(dataset)
 testindex - sample(index, trunc(length(index)*30/100))
 trainset - dataset[-testindex,]
 testset - dataset[testindex,-1]


 ## TUNE caret / kernlab
 set.seed(1)
 MyTrainControl=trainControl(
   method = repeatedcv,
   number=10,
   repeats=5,
   returnResamp = all,
   classProbs = TRUE
 )


 ## MODEL
 rbfSVM - train(outcome~., data = trainset,
method=svmRadial,
preProc = c(scale),
tuneLength = 10,
trControl=MyTrainControl,
fit = FALSE
 )

 There were 50 or more warnings (use warnings() to see the first 50)
  warnings()
 Warning messages:
 1: In train.default(x, y, weights = w, ...) :
   At least one of the class levels are not valid R variables names; This
 may cause errors if class probabilities are generated because the variables
 names will be converted to: X0, X1
 2:  In caret:::predictionFunction(method = method, modelFit = mod$fit,
  ... :
   kernlab class prediction calculations failed; returning NAs

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --

 Max





-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret train and trainControl

2012-11-23 Thread Max Kuhn

Brian,

This is all outlined in the package documentation. The final model is fit
automatically. For example, using 'verboseIter' provides details. From
?train

 knnFit1 - train(TrainData, TrainClasses,

+  method = knn,

+  preProcess = c(center, scale),

+  tuneLength = 10,

+  trControl = trainControl(method = cv, verboseIter =
TRUE))

+ Fold01: k= 5

- Fold01: k= 5

+ Fold01: k= 7

- Fold01: k= 7

+ Fold01: k= 9

- Fold01: k= 9

+ Fold01: k=11

- Fold01: k=11

snip

+ Fold10: k=17

- Fold10: k=17

+ Fold10: k=19

- Fold10: k=19

+ Fold10: k=21

- Fold10: k=21

+ Fold10: k=23

- Fold10: k=23

Aggregating results

Selecting tuning parameters

Fitting model on full training set


Max


On Fri, Nov 23, 2012 at 5:52 PM, Brian Feeny bfe...@mac.com wrote:


 I am used to packages like e1071 where you have a tune step and then pass
 your tunings to train.

 It seems with caret, tuning and training are both handled by train.

 I am using train and trainControl to find my hyper parameters like so:

 MyTrainControl=trainControl(
   method = cv,
   number=5,
   returnResamp = all,
classProbs = TRUE
 )

 rbfSVM - train(label~., data = trainset,
method=svmRadial,
tuneGrid =
 expand.grid(.sigma=c(0.0118),.C=c(8,16,32,64,128)),
trControl=MyTrainControl,
fit = FALSE
 )

 Once this returns my ideal parameters, in this case Cost of 64, do I
 simply just re-run the whole process again, passing a grid only containing
 the specific parameters? like so?


 rbfSVM - train(label~., data = trainset,
method=svmRadial,
tuneGrid = expand.grid(.sigma=0.0118,.C=64),
trControl=MyTrainControl,
fit = FALSE
 )

 This is what I have been doing but I am new to caret and want to make sure
 I am doing this correctly.

 Brian

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Decision Tree: Am I Missing Anything?

2012-09-22 Thread Max Kuhn

Vik,

On Fri, Sep 21, 2012 at 12:42 PM, Vik Rubenfeld v...@mindspring.com wrote:
 Max, I installed C50. I have a question about the syntax. Per the C50 manual:

 ## Default S3 method:
 C5.0(x, y, trials = 1, rules= FALSE,
 weights = NULL,
 control = C5.0Control(),
 costs = NULL, ...)

 ## S3 method for class ’formula’
 C5.0(formula, data, weights, subset,
 na.action = na.pass, ...)

 I believe I need the method for class 'formula'. But I don't yet see in the 
 manual how to tell C50 that I want to use that method. If I run:

 respLevel = read.csv(Resp Level Data.csv)
 respLevelTree = C5.0(BRAND_NAME ~ PRI + PROM + REVW + MODE + FORM + FAMI + 
 DRRE + FREC + SPED, data = respLevel)

 ...I get an error message:

 Error in gsub(:, ., x, fixed = TRUE) :
   input string 18 is invalid in this locale

You're not doing it wrong.

Can you send me the results of sessionInfo()? I think there are a few
issues with the function on windows, so a reproducible example would
help solve the issue.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Caret: Use timingSamps leads to error

2012-07-12 Thread Max Kuhn

I can reproduce the errors. I'll take a look.

Thanks,

Max

On Thu, Jul 12, 2012 at 5:24 AM, Dominik Bruhn domi...@dbruhn.de wrote:
 I want to use the caret package and found out about the timingSamps
 obtion to obtain the time which is needed to predict results. But, as
 soon as I set a value for this option, the whole model generation fails.
 Check this example:

 -
 library(caret)

 tc=trainControl(method='LGOCV', timingSamps=10)
 tcWithout=trainControl(method='LGOCV')

 x=train(Volume~Girth+Height, method=lm, data=trees, trControl=tcWithout)

 x=train(Volume~Girth+Height, method=lm, data=trees, trControl=tc)
 Error in eval(expr, envir, enclos) : object 'Girth' not found
 Timing stopped at: 0 0 0.003
 

 As you can see, the model generation works without the timingSamps
 option but fails if it is specified.

 What am I doing wrong?

 My sessioninfo:
 --
 R version 2.15.0 (2012-03-30)
 Platform: x86_64-pc-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=C LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 other attached packages:
 [1] MASS_7.3-18caret_5.15-023 foreach_1.4.0  cluster_1.14.2
 reshape_0.8.4
 [6] plyr_1.7.1 lattice_0.20-6

 loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6
 [5] tools_2.15.0
 -

 Thanks!
 --
 Dominik Bruhn
 mailto: domi...@dbruhn.de




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret() train based on cross validation - split dataset to keep sites together?

2012-05-30 Thread Max Kuhn

Tyrell,

If you want to have the folds contain data from only one site at a
time, you can develop a set of row indices and pass these to the index
argument in trainControl. For example

   index = list(site1 = c(1, 6, 8, 12), site2 = c(120, 152, 176, 178),
site3 = c(754, 789, 981))

The first fold would fit a model on those site 1 data in the first
argument and predict everything else, and so on.

I'm not sure if this is what you need, but there you go.

Max

On Wed, May 30, 2012 at 7:55 AM, Tyrell Deweber jtdewe...@gmail.com wrote:
 Hello all,

 I have searched and have not yet identified a solution so now I am sending
 this message. In short, I need to split my data into training, validation,
 and testing subsets that keep all observations from the same sites together
 – preferably as part of a cross validation procedure. Now for the longer
 version. And I must confess that although my R skills are improving, they
 are not so highly developed.

 I am using 10 fold cross validation with 3 repeats in the train function of
 the caret() package to identify an optimal nnet (neural network) model to
 predict daily river water temperature at unsampled sites. I am also
 withholding data from 10% of sites to have a better understanding of
 generalization error. However, the focus on predictions at other sites is
 turning out to be not easily facilitated – as far as I can see.  My data
 structure (example at bottom of email) consists of columns identifying the
 site, the date, the water temperature on that day for the site (response
 variable), and many predictors.  There are over 220,000 individual
 observations at ~1,000 sites, and each site has a minimum of 30
 observations.  It is important to keep sites separate because selecting a
 model based on predictions at an already sampled site is likely
 overly-optimistic.

 Is there a way to split data for (or preferably during) cross validation
 procedure to:

 1.) Selects a separate validation dataset from 10% of sites
 2.) Splits remaining training data into cross validation subsets and most
 importantly, keeping all observations from a site together
 3.) Secondarily, constrain partitions to be similar - ideally based on
 distributions of all variables

 It seems that some combination of the sample.split function of the caTools()
 package and the createdataPartition function of caret() might do this, but I
 am at a loss for how to code that.

 If this is not possible, I would be content to skip the cross validation
 procedure and create three similar splits of my data that keep all
 observations from a site together – one for training, one for testing, and
 one for validation.  The alternative goal here would be to split the data
 where 80% of sites are training, 10% of sites are for testing (model
 selection), and 10% of sites for validation.

 Thank you and please let me know if there are any remaining questions.  This
 is my first post as well, so if I left anything out that would be good to
 know as well.

 Tyrell Deweber



 R version 2.13.1 (2011-07-08)
 Copyright (C) 2011 The R Foundation for Statistical Computing
 ISBN 3-900051-07-0
 Platform: x86_64-redhat-linux-gnu (64-bit)

 Comid   tempymd    watmntemp   airtemp predictorb    …
 15433    1980-05-01  11.4  22.1 …
 15433    1980-05-02  11.6  23.6     …
 15433    1980-05-03  11.2  28.5
 15687    1980-06-01  13.5  26.5
 15687    1980-06-02  14.2  26.9
 15687    1980-06-03  13.8  28.9
 18994    1980-04-05  8.4   16.4
 18994    1980-04-06  8.3   12.6
 90342    1980-07-13  18.9  22.3
 90342    1980-07-14  19.3  28.4


 EXAMPLE SCRIPT FOR MODEL FITTING


 fitControl - trainControl(method = repeatedcv, number=10, repeats=3)

 tuning - read.table(temptunegrid.txt,head=T,sep=,)
 tuning


 # # Model with 100 iterations
 registerDoMC(4)
 tempmod100its - train(watmntemp~tempa + tempb + tempc + tempd + tempe +
 netarea + netbuffor + strmslope +
        netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest +
 buffor + tempa7day + tempb7day +
        tempc7day + tempd7day + tempe7day +  tempa30day + tempb30day +
 tempc30day + tempd30day +
        tempe30day, data = temp.train, method = nnet, linout=T, maxit =
 100,
        MaxNWts = 10, metric = RMSE, trControl = fitControl, tuneGrid
 = tuning, trace = T)

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide

Re: [R] caret: Error when using rpart and CV != LOOCV

2012-05-17 Thread Max Kuhn

Dominik,

There are a number of formulations of this statistic (see the
Kvålseth[*] reference below).

I tend to think of R^2 as the proportion of variance explained by the
model[**]. With the traditional formula, it is possible to get
negative proportions (if there are extreme outliers in the
predictions, the negative proportion can be very large). I used this
formulation because it is always on (0, 1). It is called R^2 after
all!

Here is an example:

 set.seed(1)
 simObserved - rnorm(100)
 simPredicted - simObserved + rnorm(100)*.1

 cor(simObserved, simPredicted)^2
[1] 0.9887525
 customSummary(data.frame(obs = simObserved,
+  pred = simPredicted))
  RMSE   Rsquared
0.09538273 0.98860908

 simPredicted[1]
[1] -0.6884905
 simPredicted[1] - 10

 cor(simObserved, simPredicted)^2
[1] 0.3669257
 customSummary(data.frame(obs = simObserved,
+  pred = simPredicted))
 RMSE  Rsquared
 1.066900 -0.425169

It is somewhat extreme, but it does happen.

Max


* Kvålseth, T. (1985). Cautionary note about $R^2$. American
statistician, 39(4), 279–285.
* This is a very controversial statement when non-linear models are
used. I'd rather use RMSE, but many scientists I work with still think
in terms of R^2 regardless of the model. The randomForest function
also computes this statistic, but calls it % Var explained instead
of explicitly labeling it as R^2. This statistic has generated
heated debates and I hope that I will not have to wear a scarlet R in
Nashville in a few weeks.


On Thu, May 17, 2012 at 1:35 PM, Dominik Bruhn domi...@dbruhn.de wrote:
 Hy Max,
 thanks again for the answer.

 I checked the caret implementation and you were right. If the
 predictions for the model constant (or sd(pred)==0) then the
 implementation returns a NA for the rSquare (in postResample). This is
 mainly because the caret implementation uses `cor` (from the
 stats-package) which would throw a error for values with sd(pred)==0.

 Do you know why this is implemented in this way? I wrote my own
 summaryFunction which calculates rSquare by hand and it works fine. It
 nevertheless does NOT(!) generate the same values as the original
 implementation. It seems that the calcuation of Rsquare does not seem to
 be consistent. I took mine from Wikipedia [1].

 Here is my code:
 ---
 customSummary - function (data, lev = NULL, model = NULL) {
         #Calulate rSquare
         ssTot - sum((data$obs-mean(data$obs))^2)
         ssErr - sum((data$obs-data$pred)^2)
         rSquare - 1-(ssErr/ssTot)

         #Calculate MSE
         mse - mean((data$pred - data$obs)^2)

         #Aggregate
         out - c(sqrt(mse), 1-(ssErr/ssTot))
         names(out) - c(RMSE, Rsquared)

         return(out)
 }
 ---

 [1]: http://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions

 Thanks!
 Dominik




 On 17/05/12 04:10, Max Kuhn wrote:
 Dominik,

 See this line:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  30.37   30.37   30.37   30.37   30.37   30.37

 The variance of the predictions is zero. caret uses the formula for
 R^2 by calculating the correlation between the observed data and the
 predictions which uses sd(pred) which is zero. I believe that the same
 would occur with other formulas for R^2.

 Max

 On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn domi...@dbruhn.de wrote:
 Thanks Max for your answer.

 First, I do not understand your post. Why is it a problem if two of
 predictions match? From the formula for calculating R^2 I can see that
 there will be a DivByZero iff the total sum of squares is 0. This is
 only true if the predictions of all the predicted points from the
 test-set are equal to the mean of the test-set. Why should this happen?

 Anyway, I wrote the following code to check what you tried to tell:

 --
 library(caret)
 data(trees)
 formula=Volume~Girth+Height

 customSummary - function (data, lev = NULL, model = NULL) {
    print(summary(data$pred))
    return(defaultSummary(data, lev, model))
 }

 tc=trainControl(method='cv', summaryFunction=customSummary)
 train(formula, data=trees,  method='rpart', trControl=tc)
 --

 This outputs:
 ---
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  18.45   18.45   18.45   30.12   35.95   53.44
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  22.69   22.69   22.69   32.94   38.06   53.44
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  30.37   30.37   30.37   30.37   30.37   30.37
 [cut many values like this]
 Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
 method = method,  :
  There were missing values in resampled performance measures.
 -

 As I didn't understand your post, I don't know if this confirms your
 assumption.

 Thanks anyway,
 Dominik


 On 16/05/12 17:30, Max Kuhn wrote:
 More information is needed to be sure, but it is most likely that some
 of the resampled rpart models produce the same prediction for the
 hold-out samples (likely the result of no viable split being found).

 Almost

Re: [R] caret: Error when using rpart and CV != LOOCV

2012-05-16 Thread Max Kuhn

More information is needed to be sure, but it is most likely that some
of the resampled rpart models produce the same prediction for the
hold-out samples (likely the result of no viable split being found).

Almost every incarnation of R^2 requires the variance of the
prediction. This particular failure mode would result in a divide by
zero.

Try using you own summary function (see ?trainControl) and put a
print(summary(data$pred)) in there to verify my claim.

Max

On Wed, May 16, 2012 at 11:30 AM, Max Kuhn mxk...@gmail.com wrote:
 More information is needed to be sure, but it is most likely that some
 of the resampled rpart models produce the same prediction for the
 hold-out samples (likely the result of no viable split being found).

 Almost every incarnation of R^2 requires the variance of the
 prediction. This particular failure mode would result in a divide by
 zero.

 Try using you own summary function (see ?trainControl) and put a
 print(summary(data$pred)) in there to verify my claim.

 Max

 On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn domi...@dbruhn.de wrote:
 Hy,
 I got the following problem when trying to build a rpart model and using
 everything but LOOCV. Originally, I wanted to used k-fold partitioning,
 but every partitioning except LOOCV throws the following warning:

 
 Warning message: In nominalTrainWorkflow(dat = trainData, info =
 trainInfo, method = method, : There were missing values in resampled
 performance measures.
 -

 Below are some simplified testcases which repoduce the warning on my
 system.

 Question: What does this error mean? How can I avoid it?

 System-Information:
 -
 sessionInfo()
 R version 2.15.0 (2012-03-30)
 Platform: x86_64-pc-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] rpart_3.1-52   caret_5.15-023 foreach_1.4.0  cluster_1.14.2
 reshape_0.8.4
 [6] plyr_1.7.1     lattice_0.20-6

 loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0     iterators_1.0.6
 [5] tools_2.15.0
 ---


 Simlified Testcase I: Throws warning
 ---
 library(caret)
 data(trees)
 formula=Volume~Girth+Height
 train(formula, data=trees,  method='rpart')
 ---

 Simlified Testcase II: Every other CV-method also throws the warning,
 for example using 'cv':
 ---
 library(caret)
 data(trees)
 formula=Volume~Girth+Height
 tc=trainControl(method='cv')
 train(formula, data=trees,  method='rpart', trControl=tc)
 ---

 Simlified Testcase III: The only CV-method which is working is 'LOOCV':
 ---
 library(caret)
 data(trees)
 formula=Volume~Girth+Height
 tc=trainControl(method='LOOCV')
 train(formula, data=trees,  method='rpart', trControl=tc)
 ---


 Thanks!
 --
 Dominik Bruhn
 mailto: domi...@dbruhn.de




 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --

 Max



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret: Error when using rpart and CV != LOOCV

2012-05-16 Thread Max Kuhn

Dominik,

See this line:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  30.37   30.37   30.37   30.37   30.37   30.37

The variance of the predictions is zero. caret uses the formula for
R^2 by calculating the correlation between the observed data and the
predictions which uses sd(pred) which is zero. I believe that the same
would occur with other formulas for R^2.

Max

On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn domi...@dbruhn.de wrote:
 Thanks Max for your answer.

 First, I do not understand your post. Why is it a problem if two of
 predictions match? From the formula for calculating R^2 I can see that
 there will be a DivByZero iff the total sum of squares is 0. This is
 only true if the predictions of all the predicted points from the
 test-set are equal to the mean of the test-set. Why should this happen?

 Anyway, I wrote the following code to check what you tried to tell:

 --
 library(caret)
 data(trees)
 formula=Volume~Girth+Height

 customSummary - function (data, lev = NULL, model = NULL) {
    print(summary(data$pred))
    return(defaultSummary(data, lev, model))
 }

 tc=trainControl(method='cv', summaryFunction=customSummary)
 train(formula, data=trees,  method='rpart', trControl=tc)
 --

 This outputs:
 ---
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  18.45   18.45   18.45   30.12   35.95   53.44
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  22.69   22.69   22.69   32.94   38.06   53.44
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  30.37   30.37   30.37   30.37   30.37   30.37
 [cut many values like this]
 Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
 method = method,  :
  There were missing values in resampled performance measures.
 -

 As I didn't understand your post, I don't know if this confirms your
 assumption.

 Thanks anyway,
 Dominik


 On 16/05/12 17:30, Max Kuhn wrote:
 More information is needed to be sure, but it is most likely that some
 of the resampled rpart models produce the same prediction for the
 hold-out samples (likely the result of no viable split being found).

 Almost every incarnation of R^2 requires the variance of the
 prediction. This particular failure mode would result in a divide by
 zero.

 Try using you own summary function (see ?trainControl) and put a
 print(summary(data$pred)) in there to verify my claim.

 Max

 On Wed, May 16, 2012 at 11:30 AM, Max Kuhn mxk...@gmail.com wrote:
 More information is needed to be sure, but it is most likely that some
 of the resampled rpart models produce the same prediction for the
 hold-out samples (likely the result of no viable split being found).

 Almost every incarnation of R^2 requires the variance of the
 prediction. This particular failure mode would result in a divide by
 zero.

 Try using you own summary function (see ?trainControl) and put a
 print(summary(data$pred)) in there to verify my claim.

 Max

 On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn domi...@dbruhn.de wrote:
 Hy,
 I got the following problem when trying to build a rpart model and using
 everything but LOOCV. Originally, I wanted to used k-fold partitioning,
 but every partitioning except LOOCV throws the following warning:

 
 Warning message: In nominalTrainWorkflow(dat = trainData, info =
 trainInfo, method = method, : There were missing values in resampled
 performance measures.
 -

 Below are some simplified testcases which repoduce the warning on my
 system.

 Question: What does this error mean? How can I avoid it?

 System-Information:
 -
 sessionInfo()
 R version 2.15.0 (2012-03-30)
 Platform: x86_64-pc-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] rpart_3.1-52   caret_5.15-023 foreach_1.4.0  cluster_1.14.2
 reshape_0.8.4
 [6] plyr_1.7.1     lattice_0.20-6

 loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0     iterators_1.0.6
 [5] tools_2.15.0
 ---


 Simlified Testcase I: Throws warning
 ---
 library(caret)
 data(trees)
 formula=Volume~Girth+Height
 train(formula, data=trees,  method='rpart')
 ---

 Simlified Testcase II: Every other CV-method also throws the warning,
 for example using 'cv':
 ---
 library(caret)
 data(trees)
 formula=Volume~Girth+Height
 tc=trainControl(method='cv')
 train(formula, data=trees,  method='rpart', trControl=tc)
 ---

 Simlified Testcase III: The only CV-method which is working is 'LOOCV':
 ---
 library(caret)
 data(trees)
 formula=Volume~Girth+Height
 tc=trainControl(method='LOOCV')
 train(formula, data=trees,  method='rpart', trControl=tc)
 ---


 Thanks!
 --
 Dominik Bruhn
 mailto: domi

Re: [R] caret package: custom summary function in trainControl doesn't work with oob?

2012-04-13 Thread Max Kuhn

Matt,

 I've been using a custom summary function to optimise regression model
 methods using the caret package. This has worked smoothly. I've been using
 the default bootstrapping resampling method. For bagging models
 (specifically randomForest in this case) caret can, in theory, uses the
 out-of-bag (oob) error estimate from the model instead of resampling, which
 (in theory) is largely redundant for such models. Since they take a while
 to build in the first place, it really slows things down when estimating
 performance using boostrap.

 I can successfully run either using the oob 'resampling method' with the
 default RMSE optimisation, or run using bootstrap and my custom
 summaryFunction as the thing to optimise, but they don't work together. If
 I try and use oob and supply a summaryFunction caret throws an error saying
 it can't find the relevant metric.

 Now, if caret is simply polling the randomForest object for the stored oob
 error I can understand this limitation

That is exactly what it does. See caret:::rfStats (not a public function)

train() was written to be fairly general and this level of control
would be very difficult to implement, especially since each model that
does some type of bagging uses different internal structures etc.

 but in the case of randomForest
 (and probably other bagging methods?) the training function can be asked to
 return information about the individual tree predictions and whether data
 points were oob in each case. With this information you can reconstruct an
 oob 'error' using whatever function you choose to target for optimisation.
 As far as I can tell, caret is not doing this and I can't see anywhere that
 it can be coerced to do so.

It will not be able to do this. I'm not sure that you can either.
randomForest() will return the individual forests and
predict.randomForest() can return the per-tree results but I don't
know if it saves the indices that tell you which bootstrap samples
contained which training set points. Perhaps Andy would know.

 Have I missed something? Can anyone suggest how this could be achieved? It
 wouldn't be *that* hard to code up something that essentially operates in
 the same way as caret.train but can handle this feature for bagging models,
 but if it is already there and I've missed something please let me know.

Well, everything is easy for the person not doing it =]

If you save the proximity measures, you might gain the sampling
indices. WIth these, you would use predict.randomForest(...,
predict.all=TRUE) to get the individual predictions.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] nonparametric densities for bounded distributions

2012-03-09 Thread Max Kuhn

Can anyone recommend a good nonparametric density approach for data bounded
(say between 0 and 1)?

For example, using the basic Gaussian density approach doesn't generate a
very realistic shape (nor should it):

 set.seed(1)
 dat - rbeta(100, 1, 2)
 plot(density(dat))

(note the area outside of 0/1)

The data I have may be bimodal or have other odd properties (e.g. point
mass at zero). I've tried transforming via the logit, estimating the
density then plotting the curve in the original units, but this seems to do
poorly in the tails (and I have data are absolute zero and one).

Thanks,

Max

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Custom caret metric based on prob-predictions/rankings

2012-02-10 Thread Max Kuhn

I think you need to read the man pages and the four vignettes. A lot
of your questions have answers there.

If you don't specify the resampling indices, they ones generated for
you are saved in the train object:

 data(iris)
 TrainData - iris[,1:4]
 TrainClasses - iris[,5]

 knnFit1 - train(TrainData, TrainClasses,
+  method = knn,
+  preProcess = c(center, scale),
+  tuneLength = 10,
+  trControl = trainControl(method = cv))
Loading required package: class

Attaching package: ‘class’

The following object(s) are masked from ‘package:reshape’:

condense

Warning message:
executing %dopar% sequentially: no parallel backend registered
 str(knnFit1$control$index)
List of 10
 $ Fold01: int [1:135] 1 2 3 4 5 6 7 9 10 11 ...
 $ Fold02: int [1:135] 1 2 3 4 5 6 8 9 10 12 ...
 $ Fold03: int [1:135] 1 3 4 5 6 7 8 9 10 11 ...
 $ Fold04: int [1:135] 1 2 3 5 6 7 8 9 10 11 ...
 $ Fold05: int [1:135] 1 2 3 4 6 7 8 9 11 12 ...
 $ Fold06: int [1:135] 1 2 3 4 5 6 7 8 9 10 ...
 $ Fold07: int [1:135] 1 2 3 4 5 7 8 9 10 11 ...
 $ Fold08: int [1:135] 2 3 4 5 6 7 8 9 10 11 ...
 $ Fold09: int [1:135] 1 2 3 4 5 6 7 8 9 10 ...
 $ Fold10: int [1:135] 1 2 4 5 6 7 8 10 11 12 ...

There is also a savePredictions argument that gives you the hold-out results.

I'm not sure which weights you are referring to.

On Fri, Feb 10, 2012 at 4:38 AM, Yang Zhang yanghates...@gmail.com wrote:
 Actually, is there any way to get at additional information beyond the
 classProbs?  In particular, is there any way to find out the
 associated weights, or otherwise the row indices into the original
 model matrix corresponding to the tested instances?

 On Thu, Feb 9, 2012 at 4:37 PM, Yang Zhang yanghates...@gmail.com wrote:
 Oops, found trainControl's classProbs right after I sent!

 On Thu, Feb 9, 2012 at 4:30 PM, Yang Zhang yanghates...@gmail.com wrote:
 I'm dealing with classification problems, and I'm trying to specify a
 custom scoring metric (recall@p, ROC, etc.) that depends on not just
 the class output but the probability estimates, so that caret::train
 can choose the optimal tuning parameters based on this metric.

 However, when I supply a trainControl summaryFunction, the data given
 to it contains only class predictions, so the only metrics possible
 are things like accuracy, kappa, etc.

 Is there any way to do this that I'm looking?  If not, could I put
 this in as a feature request?  Thanks!

 --
 Yang Zhang
 http://yz.mit.edu/



 --
 Yang Zhang
 http://yz.mit.edu/



 --
 Yang Zhang
 http://yz.mit.edu/

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Choosing glmnet lambda values via caret

2012-02-09 Thread Max Kuhn

You can adjust the candidate set of tuning parameters via the tuneGrid
argument in trian() and the process by which the optimal choice is
made (via the 'selectionFunction' argument in trainControl()). Check
out the package vignettes.

The latest version also has an update.train() function that lets the
user manually specify the tuning parameters after the call to train().

On Thu, Feb 9, 2012 at 7:00 PM, Yang Zhang yanghates...@gmail.com wrote:
 Usually when using raw glmnet I let the implementation choose the
 lambdas.  However when training via caret::train the lambda values are
 predetermined.  Is there any way to have caret defer the lambda
 choices to caret::train and thus choose the optimal lambda
 dynamically?

 --
 Yang Zhang
 http://yz.mit.edu/

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] lattice key in blank panel

2011-12-15 Thread Max Kuhn

Somewhere I've seen an example of an xyplot() where the key was placed
in a location of a missing panel. For example, if there were 3
conditioning levels, the panel grid would look like:

34
12

In this (possibly imaginary) example, there were scatter plots in
locations 1:3 and location 4 had no conditioning bar at the top, only
the key.

I can find examples of putting the legend outside of the panel
locations (e.g to the right of locations 2 and 4 above), but that's
not really what I'd like to do.

Thanks,

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] palettes for the color-blind

2011-11-02 Thread Max Kuhn

Everyone,

I'm working with scatter plots with different colored symbols (via
lattice). I'm currently using these colors for points and lines:

col1 - c(rgb(1, 0, 0), rgb(0, 0, 1),
 rgb(0, 1, 0),
 rgb(0.55482458, 0.40350876, 0.0416),
 rgb(0, 0, 0))
plot(seq(along = col1), pch = 16, col = col1, cex = 1.5)

I'm also using these with transparency (alpha between .5-.8 depending
on the number of points).

I'd like to make sure that these colors are interpretable by the color
bind. Doing a little looking around, this might be a good palette:

col2 - c(rgb(0, 0.4470588, 0.6980392),
  rgb(0.8352941, 0.3686275, 0,   ),
  rgb(0.800, 0.4745098, 0.6549020),
  rgb(0.1686275, 0.6235294, 0.4705882),
  rgb(0.9019608, 0.6235294, 0.000))

plot(seq(along = col2), pch = 16, col = col2, cex = 1.5)

but to be honest, I'd like to use something a little more vibrant.

First, can anyone verify that these the colors in col2 are
differentiable to someone who is color blind?

Second, are there any other specific palettes that can be recommended?
How do the RColorBrewer palettes rate in this respect?

Thanks,

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] palettes for the color-blind

2011-11-02 Thread Max Kuhn

Yes, I was aware of the different type and their respective prevalences.

The dichromat package helped me find what I needed.

Thanks,

Max

On Wed, Nov 2, 2011 at 6:38 PM, Thomas Lumley tlum...@uw.edu wrote:
 On Thu, Nov 3, 2011 at 11:04 AM, Carl Witthoft c...@witthoft.com wrote:

 Before you pick out a palette:  you are aware that their are several
 different types of color-blindness, aren't you?

 Yes, but to first approximation there are only two, and they have
 broadly similar, though not identical impact on choice of color
 palettes.  The dichromat package knows about them, and so does
 Professor Brewer.

 More people will be unable to read your graphs due to some kind of
 gross visual impairment (cataracts, uncorrected focusing problems,
 macular degeneration, etc) than will have tritanopia or monochromacy.

   -thomas

 --
 Thomas Lumley
 Professor of Biostatistics
 University of Auckland

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help with parallel processing code

2011-10-31 Thread Max Kuhn

I'm not sure what you mean by full code or the iteration. This uses
foreach to parallelize the loops over different tuning parameters and
resampled data sets.

The only way I could set to split up the parallelism is if you are
fitting different models to the same data. In that case, you could
launch separate jobs for each model. If the data is large and quickly
read from disk, that might be better than storing it in memory and
sequentially running models in the same script. We have decent sized
machines here, so we launch different jobs per model and then
parallelize each (even if it is using 2-3 cores it helps).

Thanks,

Max

On Fri, Oct 28, 2011 at 10:49 AM, 1Rnwb sbpuro...@gmail.com wrote:
 the part of the question dawned on me now is, should I try to do the parallel
 processing of the full code or only the iteration part? if it is full code
 then I am at the complete mercy of the R help community or I giveup on this
 and let the computation run the serial way, which is continuing from past
 sat.
 Sharad

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/help-with-parallel-processing-code-tp3944303p3948118.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Contrasts with an interaction. How does one specify the dummy variables for the interaction

2011-10-31 Thread Max Kuhn

This is failing because it is a saturated model and the contrast
package tries to do a t-test (instead of a z test). I can add code to
do this, but it will take a few days.

Max

On Fri, Oct 28, 2011 at 2:16 PM, John Sorkin
jsor...@grecc.umaryland.edu wrote:
 Forgive my resending this post. To data I have received only one response 
 (thank you Bert Gunter), and I still do not have an answer to my question.
 Respectfully,
 John


 Windows XP
 R 2.12.1
 contrast package.


 I am trying to understand how to create contrasts for a model that contatains 
 an interaction. I can get contrasts to work for a model without interaction, 
 but not after adding the interaction. Please see code below. The last two 
 contrast statements show the problem. I would appreciate someone letting me 
 know what is wrong with the syntax of my contrast statements.
 Thank you,
 John


 library(contrast)

 # Create 2x2 contingency table.
 counts=c(50,50,30,70)
 row -    gl(2,2,4)
 column - gl(2,1,4)
 mydata - data.frame(row,column,counts)
 print(mydata)

 # Show levels of 2x2 table
 levels(mydata$row)
 levels(mydata$column)


 # Models, no interaction, and interaction
 fitglm0 - glm(counts ~ row + column,              family=poisson(link=log))
 fitglm  - glm(counts ~ row + column + row*column, family=poisson(link=log))

 # Contrasts for model without interaction works fine!
 anova(fitglm0)
 summary(fitglm0)
 con0-contrast(fitglm0,list(row=1,column=1))
 print(con0,X=TRUE)

 # Contrast for model with interaction does not work.
 anova(fitglm)
 summary(fitglm)
 con-contrast(fitglm,list(row=1,column=1)
 print(con,X=TRUE)

 # Nor does this work.
 con-contrast(fitglm,list(row=1,column=1,row:column=c(0,0)))
 print(con,X=TRUE)




 John David Sorkin M.D., Ph.D.
 Chief, Biostatistics and Informatics
 University of Maryland School of Medicine Division of Gerontology
 Baltimore VA Medical Center
 10 North Greene Street
 GRECC (BT/18/GR)
 Baltimore, MD 21201-1524
 (Phone) 410-605-7119
 (Fax) 410-605-7913 (Please call phone number above prior to faxing)

 Confidentiality Statement:
 This email message, including any attachments, is for ...{{dropped:16}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help with parallel processing code

2011-10-27 Thread Max Kuhn

I have had issues with some parallel backends not finding functions
within a namespace for packages listed in the .packages argument or
explicitly loaded in the body of the foreach loop. This has occurred
with MPI but not with multicore. I can get around this to some extent
by calling the functions using the namespace (eg foo:::bar) but this
is pretty kludgy.

 sessionInfo()
R version 2.13.2 (2011-09-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] doMPI_0.1-5 Rmpi_0.5-9  doMC_1.2.3  multicore_0.1-7
foreach_1.3.2   codetools_0.2-8 iterators_1.0.5

Max

On Thu, Oct 27, 2011 at 4:30 PM, 1Rnwb sbpuro...@gmail.com wrote:
 If i understand correctly you mean to write the line as below:

 foreach(icount(itr),.combine=combine,.options.smp=smpopts,.packages='MASS')%dopar%

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/help-with-parallel-processing-code-tp3944303p3945954.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] difference between createPartition and createfold functions

2011-10-03 Thread Max Kuhn

No, it is an argument to createFolds. Type ?createFolds to see the
appropriate syntax: returnTrain a logical. When true, the values
returned are the sample positions corresponding to the data used
during training. This argument only works in conjunction with list =
TRUE

On Mon, Oct 3, 2011 at 11:10 AM,  bby2...@columbia.edu wrote:
 Hi Max,

 Thanks for the note. In your last paragraph, did you mean in
 createDataPartition? I'm a little vague about what returnTrain option does.

 Bonnie

 Quoting Max Kuhn mxk...@gmail.com:

 Basically, createDataPartition is used when you need to make one or
 more simple two-way splits of your data. For example, if you want to
 make a training and test set and keep your classes balanced, this is
 what you could use. It can also make multiple splits of this kind (or
 leave-group-out CV aka Monte Carlos CV aka repeated training test
 splits).

 createFolds is exclusively for k-fold CV. Their usage is simular when
 you use the returnTrain = TRUE option in createFolds.

 Max

 On Sun, Oct 2, 2011 at 4:00 PM, Steve Lianoglou
 mailinglist.honey...@gmail.com wrote:

 Hi,

 On Sun, Oct 2, 2011 at 3:54 PM,  bby2...@columbia.edu wrote:

 Hi Steve,

 Thanks for the note. I did try the example and the result didn't make
 sense
 to me. For splitting a vector, what you describe is a big difference btw
 them. For splitting a dataframe, I now wonder if these 2 functions are
 the
 wrong choices. They seem to split the columns, at least in the few
 things I
 tried.

 Sorry, I'm a bit confused now as to what you are after.

 You don't pass in a data.frame into any of the
 createFolds/DataPartition functions from the caret package.

 You pass in a *vector* of labels, and these functions tells you which
 indices into the vector to use as examples to hold out (or keep
 (depending on the value you pass in for the `returnTrain` argument))
 between each fold/partition of your learning scenario (eg. cross
 validation with createFolds).

 You would then use these indices to keep (remove) the rows of a
 data.frame, if that is how you are storing your examples.

 Does that make sense?

 -steve

 --
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
  | Memorial Sloan-Kettering Cancer Center
  | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --

 Max








-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] difference between createPartition and createfold functions

2011-10-02 Thread Max Kuhn

Basically, createDataPartition is used when you need to make one or
more simple two-way splits of your data. For example, if you want to
make a training and test set and keep your classes balanced, this is
what you could use. It can also make multiple splits of this kind (or
leave-group-out CV aka Monte Carlos CV aka repeated training test
splits).

createFolds is exclusively for k-fold CV. Their usage is simular when
you use the returnTrain = TRUE option in createFolds.

Max

On Sun, Oct 2, 2011 at 4:00 PM, Steve Lianoglou
mailinglist.honey...@gmail.com wrote:
 Hi,

 On Sun, Oct 2, 2011 at 3:54 PM,  bby2...@columbia.edu wrote:
 Hi Steve,

 Thanks for the note. I did try the example and the result didn't make sense
 to me. For splitting a vector, what you describe is a big difference btw
 them. For splitting a dataframe, I now wonder if these 2 functions are the
 wrong choices. They seem to split the columns, at least in the few things I
 tried.

 Sorry, I'm a bit confused now as to what you are after.

 You don't pass in a data.frame into any of the
 createFolds/DataPartition functions from the caret package.

 You pass in a *vector* of labels, and these functions tells you which
 indices into the vector to use as examples to hold out (or keep
 (depending on the value you pass in for the `returnTrain` argument))
 between each fold/partition of your learning scenario (eg. cross
 validation with createFolds).

 You would then use these indices to keep (remove) the rows of a
 data.frame, if that is how you are storing your examples.

 Does that make sense?

 -steve

 --
 Steve Lianoglou
 Graduate Student: Computational Systems Biology
  | Memorial Sloan-Kettering Cancer Center
  | Weill Medical College of Cornell University
 Contact Info: http://cbio.mskcc.org/~lianos/contact

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Combining multiple output statements in a function

2011-09-16 Thread Max Kuhn

formatting.odf, page 7. The results are in formattingOut.odt

On Thu, Sep 15, 2011 at 2:44 PM, Jan van der Laan rh...@eoos.dds.nl wrote:
 Max,

 Thank you for your answer. I have had another look at the examples (I
 already had before mailing the list), but could find the example you
 mention. Could you perhaps tell me which example I should have a look at?

 Regards,
 Jan



 On 09/15/2011 04:47 PM, Max Kuhn wrote:

 There are examples in the package directory that explain this.

 On Thu, Sep 15, 2011 at 8:16 AM, Jan van der Laanrh...@eoos.dds.nl
  wrote:

 What is the correct way to combine multiple calls to odfCat, odfItemize,
 odfTable etc. inside a function?

 As an example lets say I have a function that needs to write two
 paragraphs
 of text and a list to the resulting odf-document (the real function has
 much
 more complex logic, but I don't think thats relevant). My first guess
 would
 be:

 exampleOutput- function() {
   odfCat(This is the first paragraph)
   odfCat(This is the second paragraph)
   odfItemize(letters[1:5])
 }

 However, calling this function in my odf-document only generates the last
 list as only the output of the odfItemize function is returned by
 exampleOutput. How do I combine the three results into one to be returned
 by
 exampleOutput?

 I tried to wrap the calls to the odf* functions into a print statement:

 exampleOutput2- function() {
   print(odfCat(This is the first paragraph))
   print(odfCat(This is the second paragraph))
   print(odfItemize(letters[1:5]))
 }

 In another document this seemed to work, but in my current document
 strange
 odf-output is generated.

 Regards,

 Jan

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.








-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave: Combining multiple output statements in a function

2011-09-15 Thread Max Kuhn

There are examples in the package directory that explain this.

On Thu, Sep 15, 2011 at 8:16 AM, Jan van der Laan rh...@eoos.dds.nl wrote:

 What is the correct way to combine multiple calls to odfCat, odfItemize,
 odfTable etc. inside a function?

 As an example lets say I have a function that needs to write two paragraphs
 of text and a list to the resulting odf-document (the real function has much
 more complex logic, but I don't think thats relevant). My first guess would
 be:

 exampleOutput - function() {
   odfCat(This is the first paragraph)
   odfCat(This is the second paragraph)
   odfItemize(letters[1:5])
 }

 However, calling this function in my odf-document only generates the last
 list as only the output of the odfItemize function is returned by
 exampleOutput. How do I combine the three results into one to be returned by
 exampleOutput?

 I tried to wrap the calls to the odf* functions into a print statement:

 exampleOutput2 - function() {
   print(odfCat(This is the first paragraph))
   print(odfCat(This is the second paragraph))
   print(odfItemize(letters[1:5]))
 }

 In another document this seemed to work, but in my current document strange
 odf-output is generated.

 Regards,

 Jan

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Trying to extract probabilities in CARET (caret) package with a glmStepAIC model

2011-08-28 Thread Max Kuhn

Can you provide a reproducible example and the results of
sessionInfo()? What are the levels of your classes?

On Sat, Aug 27, 2011 at 10:43 PM, Jon Toledo tintin...@hotmail.com wrote:

 Dear developers,
 I have jutst started working with caret and all the nice features it offers. 
 But I just encountered a problem:
 I am working with a dataset that include 4 predictor variables in Descr and a 
 two-category outcome in Categ (codified as a factor).
 Everything was working fine I got the results, confussion matrix etc.
 BUT for obtaining the AUC and predicted probabilities I had to add  
 classProbs = TRUE, in the trainControl. Thereafter everytime I run train I 
 get this message:
 undefined columns selected

 I copy the syntax:
 fitControl - trainControl(method = cv, number = 10, classProbs = 
 TRUE,returnResamp = all, verboseIter = FALSE)
 glmFit - train(Descr, Categ, method = glmStepAIC,tuneLength = 4,trControl 
 = fitControl)
 Thank you.
 Best regards,

 Jon Toledo, MD

 Postdoctoral fellow
 University of Pennsylvania School of Medicine
 Center for Neurodegenerative Disease Research
 3600 Spruce Street
 3rd Floor Maloney Building
 Philadelphia, Pa 19104

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] aucRoc in caret package [SEC=UNCLASSIFIED]

2011-06-01 Thread Max Kuhn

David,

The ROC curve should really be computed with some sort of numeric data
(as opposed to classes). It varies the cutoff to get a continuum of
sensitivity and specificity values.  Using the classes as 1's and 2's
implies that the second class is twice the value of the first, which
doesn't really make sense.

Try getting the class probabilities for predicted1 and predicted2 and
use those instead.

Thanks,

Max


On Wed, Jun 1, 2011 at 9:24 PM, jin...@ga.gov.au wrote:

 Please note that predicted1 and predicted2 are two sets of predictions 
 instead of predictors. As you can see the predictions with only two levels, 1 
 is for hard and 2 for soft. I need to assess which one is more accurate. Hope 
 this is clear now. Thanks.
 Jin

 -Original Message-
 From: David Winsemius [mailto:dwinsem...@comcast.net]
 Sent: Thursday, 2 June 2011 10:55 AM
 To: Li Jin
 Cc: R-help@r-project.org
 Subject: Re: [R] aucRoc in caret package [SEC=UNCLASSIFIED]

 Using AUC for discrete predictor variables with inly two levels
 doesn't seem very sensible. What are you planning to to with this
 measure?

 --
 David.

 On Jun 1, 2011, at 8:47 PM, jin...@ga.gov.au jin...@ga.gov.au wrote:

  Hi all,
  I used the following code and data to get auc values for two sets of
  predictions:
             library(caret)
  table(predicted1, trainy)
    trainy
     hard soft
   1   27    0
   2   11   99
  aucRoc(roc(predicted1, trainy))
  [1] 0.5
 
 
  table(predicted2, trainy)
    trainy
     hard soft
   1   27    2
   2   11   97
  aucRoc(roc(predicted2, trainy))
  [1] 0.8451621
 
  predicted1:
  1 1 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2
  2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
  2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 2 2
  2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 
  predicted2:
  1 1 2 1 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2
  2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
  2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 2 2
  2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 
  trainy:
  hard hard hard soft soft hard hard hard hard soft soft soft soft
  soft soft hard soft soft soft soft soft soft hard soft soft soft
  soft soft soft soft soft soft hard soft soft soft soft soft hard
  soft soft soft soft hard hard soft soft soft hard soft hard soft
  soft soft soft soft hard soft soft soft soft soft soft soft soft
  hard soft soft soft soft soft hard soft soft soft soft soft soft
  soft hard soft soft soft hard hard hard hard hard soft soft hard
  hard hard soft hard soft soft soft hard hard soft soft soft soft
  soft hard hard hard hard hard hard hard soft soft soft soft soft
  soft soft soft soft soft soft soft soft soft soft soft hard soft
  soft soft soft soft soft soft soft
  Levels: hard soft
 
  Sys.info()
                      sysname
  release                      version                     nodename
                    Windows                      XP        build
  2600, Service Pack 3        PC-60772
                      machine
                        x86
 
  I would expect predicted1 is more accurate that the predicted2. But
  the auc values show an opposite. I was wondering whether this is a
  bug or I have done something wrong.  Thanks for your help in advance!
 
  Cheers,
 
  Jin
  
  Jin Li, PhD
  Spatial Modeller/Computational Statistician
  Marine  Coastal Environment
  Geoscience Australia
  GPO Box 378, Canberra, ACT 2601, Australia
 
  Ph: 61 (02) 6249 9899; email:
  jin...@ga.gov.aumailto:jin...@ga.gov.au
  ___
 
 
 
        [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 David Winsemius, MD
 West Hartford, CT

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



--

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] issue with odfWeave running on Windows XP; question about installing packages under Linux

2011-05-18 Thread Max Kuhn

Sorry for the delayed response.

An upgrade of the XML package has broken odfWeave; see this thread:

   https://stat.ethz.ch/pipermail/r-help/2011-May/278063.html

That may be your issue. We're working on the problem now. I'll post to
R-Packages when we have a working update. If you like, I can send you
the eventual fixes if you would like to test them.

Thanks,

Max


On Tue, May 17, 2011 at 3:35 PM,  rmail...@justemail.net wrote:
 I also have a problem using odfWeave on Windows XP with R  R2.11.1. odfWeave 
 fails, giving mysterious error messages. (Not quite the same as yours, but 
 similar. I sent the info to Max Kuhn privately, but did not get a response 
 after two tries.) My odfWeave reporting system worked fine prior to R2.12 and 
 then the same code that ran fine under R2.11.1 stopped working. Using the 
 very same machine and running the very same code under R2.11.1 it still runs 
 fine today. So, something is not quite right with odfWeave on Windows XP for 
 R  R2.11.1, and I don't know what it is. My solution is to keep R2.11.1 
 around until it can be resolved.

 Eric



 - Original message -
 From: Cormac Long clong...@googlemail.com
 To: r-help@r-project.org
 Date: Fri, 13 May 2011 10:45:06 +0100
 Subject: [R] issue with odfWeave running on Windows XP; question about 
 installing packages under Linux

 Good morning R community,

 I have two questions (and a comment):
 1)
 A problem with odfWeave. I have an odf document
 with a table that spans multiple pages. Each cell in the table is
 populated using \sexpr{R stuff}. This worked fine on my
 own machine (windows 7 box using any R2.x.y, for x=11) and
 on a colleagues machine (Windows XP box running R2.11.1).
 However, on a third machine (Windows XP box running R2.12.0
 or R2.13.0), odfWeave fails with the following error:
    Error in parse(text = cmd) : text:1:36: unexpected ''
    1: GLOBAL_CONTAItext:soft-page-break/
 A poke around in the unzipped odt file reveals the culprit:
    \Sexpr{GLOBAL_CONTAItext:soft-page-break/NER$repDat$Dec[i]}
 which should read
    \Sexpr{GLOBAL_CONTAINER$repDat$Dec[i]}

 The page break coincides with where the table overruns from
 one page to the next.

 Now, if this was a constant error across all machines, that
 would be annoying, but ok. My questions are:
   a) Can anyone think of a sensible suggestion why has this
       happened only on one machine, and not on other machines?
   b) Is there any way of handling such silent xml modifications
      (apart from odfTable, which I have only just bumped into, or
      extremely judicious choice of table construction, which is
      tedious and unreliable)?

 2)
 When installing some packages on linux (notably RODBC and XML),
 you need to ensure that you linux distro has extra header files installed.
 This is a particular issue in Ubuntu. The question is: is there any way
 that a package can check for necessary external header files and issue
 suitable warnings? For example, if you try to install RODBC on Ubuntu
 without first installing unixodbc-dev, the installation will fail with the
 error:
    configure: error: ODBC headers sql.h and sqlext.h not found
 which is useful, but not particularly suggestive of requiring unixodbc-dev


 A further comment on odfWeave: odfWeave uses system calls to
 zip and unzip when processing the odt documents. Would it not
 be a good idea for the odfWeave package to check for the presence
 of zip and unzip utilities and report accordingly when trying to install?
 By default, Windows XP boxes do not have these utilities installed
 (installing Rtools does away with this problem).


 Many thanks in advance,
 Dr. Cormac Long.

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?

2011-05-13 Thread Max Kuhn

XiaoLiu,

I can't see the options in bootControl you used here. Your error is
consistent with leaving classProbs and summaryFunction unspecified.
Please double check that you set them with classProbs = TRUE and
summaryFunction = twoClassSummary before you ran.

Max

On Thu, May 12, 2011 at 7:04 PM, Jing Liu quiet_jing0...@hotmail.com wrote:

 Dear all,

 I am using the caret Package for predictors selection with a randomForest 
 model. The following is the train function:

 rfFit- train(x=trainRatios, y=trainClass, method=rf, importance = TRUE, 
 do.trace = 100, keep.inbag = TRUE,
    tuneGrid = grid, trControl=bootControl, scale = TRUE, metric = ROC)

 I wanted to use ROC as the metric for variable selection. I know that this 
 works with the logit model by making sure that classProbs = TRUE and 
 summaryFunction = twoClassSummary in the trainControl function. However if I 
 do the same with randomForest, I get a warning saying that

 In train.default(x = trainPred, y = trainDep, method = rf,  :
  The metric ROC was not in the result set. Accuracy will be used instead.

 I wonder if ROC metric can be used for randomForest? Have I missed something? 
 Very very grateful if anyone can help!

 Best regards,
 XiaoLiu



        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?

2011-05-13 Thread Max Kuhn

Frank,

It depends on how you define optimal. While I'm not a big fan of
using the area under the ROC to characterize performance, there are a
lot of times when likelihood measures are clearly sub-optimal in
performance. Using resampled accuracy (or Kappa) instead of deviance
(out-of-bag or not) is likely to produce more inaccurate models (not
shocking, right?).

The best example is determining the number of boosting iterations.
From Friedman (2001): ``[...] degrading the likelihood by overfitting
actually improves misclassification error rates. Although perhaps
counterintuitive, this is not a contradiction; likelihood and error
rate measure different aspects of fit quality.''

My argument here assumes that you are fitting a model for the purposes
of prediction rather than interpretation. This particular case
involves random forests, so I'm hoping that statistical inference is
not the goal.

Ref: Friedman. Greedy function approximation: a gradient boosting
machine. Annals of Statistics (2001) pp. 1189-1232

Thanks,

Max

On Fri, May 13, 2011 at 8:11 AM, Frank Harrell f.harr...@vanderbilt.edu wrote:
Using anything other than deviance (or likelihood) as the objective function
will result in a suboptimal model.
Frank

-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context:
http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3520043.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Max

Re: [R] Bigining with a Program of SVR

2011-05-07 Thread Max Kuhn

As far as caret goes, you should read

   http://cran.r-project.org/web/packages/caret/vignettes/caretVarImp.pdf

and look at rfe() and sbf().


On Fri, May 6, 2011 at 2:53 PM, ypriverol yprive...@gmail.com wrote:
 Thanks Max. I'm using now the library caret with my data. But the models
 showed a correlation under 0.7. Maybe the problem is with the variables that
 I'm using to generate the model. For that reason I'm asking for some
 packages that allow me to reduce the number of feature and to remove the
 worst features. I read recently an article taht combine Genetic algorithm
 with support vector regression to do that.

 Best Regards
 Yasset

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3503918.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-04 Thread Max Kuhn

train() uses vectors, matrices and data frames as input. I really
think you need to read materials on basic R before proceeding. Go to
the R web page. There are introductory materials there.

On Tue, May 3, 2011 at 11:19 AM, ypriverol yprive...@gmail.com wrote:
 I saw the format of the caret data some days ago. It is possible to convert
 my csv data with the same data a format as the caret dataset. My idea is to
 use firstly the same scripts as caret tutorial, then i want to remove
 problems related with data formats and incompatibilities.

 Thanks for your time

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3492746.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-03 Thread Max Kuhn

See the examples at the end of:

   http://cran.r-project.org/web/packages/caret/vignettes/caretTrain.pdf

for a QSAR data set for modeling the log blood-brain barrier
concentration. SVMs are not used there but, if you use train(), the
syntax is very similar.

On Tue, May 3, 2011 at 9:38 AM, ypriverol yprive...@gmail.com wrote:
 well, first of all thank for your answer. I need some example that works with
 Support Vector Regression. This is the format of my data:
  VDP   V1        V2  
  9.15  1234.5   10
  9.15 2345.6 15
  6.7    789.0     12
  6.7    234.6     11
  3.2   123.6      5
  3.2   235.7      8

 VDP is the experimental value of the property that i want to predict with
 the model and more accurate. The other variables V1, V2 ... are the
 properties to generate the model. I need some examples that introduce me in
 this field. I read some examples from e1071 but all of them are for
 classification problems.

 thanks for your help in advance

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3492487.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-02 Thread Max Kuhn

Yeah, that didn't work. Use

   fitControl-trainControl(index = list(seq(along = mdrrClass)))

See ?trainControl to understand what this does in detail.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-01 Thread Max Kuhn

It isn't building the same model since each fit is created from
different data sets.

The resampling is sort of the point of the function, but if you really
want to avoid it, supply your own index in trainControl that has every
index (eg, index = seq(along = mdrrClass)). In this case, the
performance it gives is the apparent error rate.

Max

On Sun, May 1, 2011 at 5:57 PM, pdb ph...@philbrierley.com wrote:
 I want to use caret to build a model with an algorithm that actually has no
 parameters to find.

 How do I stop it from repeatedly building the same model 25 times?


 library(caret)
 data(mdrr)
 LOGISTIC_model - train(mdrrDescr,mdrrClass
                        ,method='glm'
                        ,family=binomial(link=logit)
                        )
 LOGISTIC_model

 528 samples
 342 predictors
  2 classes: 'Active', 'Inactive'

 Pre-processing: None
 Resampling: Bootstrap (25 reps)

 Summary of sample sizes: 528, 528, 528, 528, 528, 528, ...

 Resampling results

  Accuracy  Kappa   Accuracy SD  Kappa SD
  0.552     0.0999  0.0388       0.0776  --
 View this message in context: 
 http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p3488761.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bigining with a Program of SVR

2011-05-01 Thread Max Kuhn

When you say variable do you mean predictors or responses?

In either case, they do. You can generally tell by reading the help
files and looking at the examples.

Max

On Fri, Apr 29, 2011 at 3:47 PM, ypriverol yprive...@gmail.com wrote:
 Hi:
  I'm starting a research of Support Vector Regression. I want to obtain a
 model to predict a property A with
  a set of property B, C, D, ...  This problem is very common for example in
 QSAR models. I want to know
  some examples and package that could help me in this way. I know about
 caret and e1071. But I' don't
  know if this package can work with continues variables.?

 Thanks in advance

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Bigining-with-a-Program-of-SVR-tp3484476p3484476.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-01 Thread Max Kuhn

No, the sampling is done on rows. The definition of a bootstrap
(re)sample is one which is the same size as the original data but
taken with replacement. The Accuracy SD and Kappa SD columns give
you a sense of how the model performance varied across these bootstrap
data sets (i.e. they are not the same data set).

In the end, the original training set is used to fit the final model
that is used for prediction.

Max

On Sun, May 1, 2011 at 6:41 PM, pdb ph...@philbrierley.com wrote:
 Hi Max,

 But in this example, it says the sample size is the same as the total number
 of samples, so unless the sampling is done by columns, wouldn't you get
 exactly the same model each time for logistic regression?

 ps - great package btw. I'm just beginning to explore its potential now.--
 View this message in context: 
 http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p341.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret - prevent resampling when no parameters to find

2011-05-01 Thread Max Kuhn

Not all modeling functions have both the formula and matrix
interface. For example, glm() and rpart() only have formula method,
enet() has only the matrix interface and ksvm() and others have both.
This was one reason I created the package (so we don't have to
remember all this).

train() lets you specify the model either way. When the actual model
is fit, it favors the matrix interface whenever possible (since it is
more efficient) and works out the details behind the scenes.

For your example, you can fit the model you want using train():

train(mdrrDescr,mdrrClass,method='glm')

If y is a factor, it automatically adds the 'family = binomial' option
when the model is fit (so you don't have to).

Max

On Sun, May 1, 2011 at 7:18 PM, pdb ph...@philbrierley.com wrote:
glm.fit - answered my own question by reading the manual!--
View this message in context:
http://r.789695.n4.nabble.com/caret-prevent-resampling-when-no-parameters-to-find-tp3488761p3488923.html
Sent from the R help mailing list archive at Nabble.com.

Max

Re: [R] odfWeave Error unzipping file in Win 7

2011-03-21 Thread Max Kuhn

I don't think that this is the issue, but test it on a file without spaces.

On Mon, Mar 21, 2011 at 2:25 PM,  rmail...@justemail.net wrote:

 I have a very similar error that cropped up when I upgraded to R 2.12 and 
 persists at R 2.12.1. I am running R on Windows XP and OO is at version 3.2. 
 I did not make any changes to my R code or ODF code or configuration to 
 produce this error. Only upgraded R.

 Many Thanks,

 Eric

 R session:


 odfWeave ( 'Report input template.odt' , 'August 2011.odt')
  Copying  Report input template.odt
  Setting wd to  
 C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2/odfWeave2153483
  Unzipping ODF file using unzip -o Report input template.odt
 Error in odfWeave(Report input template.odt, August 2011.odt) :
  Error unzipping file

 


 When I start a shell and go to the temp directory in question and copy the 
 exact command that the error message says produced an error the command runs 
 fine. Here is that session:

 Microsoft Windows XP [Version 5.1.2600]
 (C) Copyright 1985-2001 Microsoft Corp.

 H:\c:

 C:\cd C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2/odfWeave2153483

 C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483dir
  Volume in drive C has no label.
  Volume Serial Number is 7464-62CA

  Directory of 
 C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483

 03/21/2011  11:11 AM    DIR          .
 03/21/2011  11:11 AM    DIR          ..
 03/21/2011  11:11 AM            13,780 Report input template.odt
               1 File(s)         13,780 bytes
               2 Dir(s)   7,987,343,360 bytes free

 C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483unzip -o 
 Report input template.odt
 Archive:  Report input template.odt
  extracting: mimetype
   creating: Configurations2/statusbar/
  inflating: Configurations2/accelerator/current.xml
   creating: Configurations2/floater/
   creating: Configurations2/popupmenu/
   creating: Configurations2/progressbar/
   creating: Configurations2/menubar/
   creating: Configurations2/toolbar/
   creating: Configurations2/images/Bitmaps/
  inflating: content.xml
  inflating: manifest.rdf
  inflating: styles.xml
  extracting: meta.xml
  inflating: Thumbnails/thumbnail.png
  inflating: settings.xml
  inflating: META-INF/manifest.xml

 C:\DOCUME~1\Koster01\LOCALS~1\Temp\Rtmp4uCcY2\odfWeave2153483







 - Original message -
 From: psycho-ld battlecry...@web.de
 To: r-help@r-project.org
 Date: Sun, 23 Jan 2011 01:47:44 -0800 (PST)
 Subject: [R] odfWeave Error unzipping file in Win 7


 Hey guys,

 I´m just getting started with R (version 2.12.0) and odfWeave and kinda
 stumble from one problem to the next, the current one is the following:

 trying to use odfWeave:

 odfctrl - odfWeaveControl(
 +             zipCmd = c(C:/Program Files/unz552dN/VBunzip.exe $$file$$ .,
 +              C:/Program Files/unz552dN/VBunzip.exe $$file$$))

 odfWeave(C:/testat.odt, C:/iris.odt, control = odfctrl)
  Copying  C:/testat.odt
  Setting wd to
 D:\Users\egf\AppData\Local\Temp\Rtmpmp4E1J/odfWeave23103351832
  Unzipping ODF file using C:/Program Files/unz552dN/VBunzip.exe
 testat.odt
 Fehler in odfWeave(C:/testat.odt, C:/iris.odt, control = odfctrl) :
  Error unzipping file

 so I tried a few other unzipping programs like jar and 7-zip, but still the
 same problem occurs, I also tried to install zip and unzip, but then I get
 some error message that registration failed (Error 1904 )

 so if there are anymore questions, just ask, would be great if someone could
 help me though

 cheers
 psycho-ld

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/odfWeave-Error-unzipping-file-in-Win-7-tp3232359p3232359.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Specify feature weights in model prediction (CARET)

2011-03-16 Thread Max Kuhn

 Using the 'CARET' package, is it possible to specify weights for features
 used in model prediction?

For what model?

 And for the 'knn' implementation, is there a way
 to choose a distance metric (i.e. Mahalanobis distance)?


No, sorry.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] use caret to rank predictors by random forest model

2011-03-07 Thread Max Kuhn

It would help if you provided the code that you used for the caret functions.

The most likely issues is not using importance = TRUE in the call to train()

I believe that I've only implemented code for plotting the varImp
objects resulting from train() (eg. there is plot.varImp.train but not
plot.varImp).

Max

On Mon, Mar 7, 2011 at 3:27 PM, Xiaoqi Cui x...@mtu.edu wrote:
 Hi,

 I'm using package caret to rank predictors using random forest model and 
 draw predictors importance plot. I used below commands:

 rf.fit-randomForest(x,y,ntree=500,importance=TRUE)
 ## x is matrix whose columns are predictors, y is a binary resonse vector
 ## Then I got the ranked predictors by ranking 
 rf1$importance[,MeanDecreaseAccuracy]
 ## Then draw the importance plot
 varImpPlot(rf.fit)

 As you can see, all the functions I used are directly from the package 
 randomForest, instead of from caret. so I'm wondering if the package 
 caret has some functions who can do the above ranking and ploting.

 In fact, I tried functions train, varImp and plot from package caret, 
 the random forest model that built by train can not be input correctly to 
 varImp, which gave error message like subscripts out of bounds. Also 
 function plot doesn't work neither.

 So I'm wondering if anybody has encountered the same problem before, and 
 could shed some light on this. I would really appreciate your help.

 Thanks,
 Xiaoqi

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Course: R for Predictive Modeling: A Hands-On Introduction

2011-03-04 Thread Max Kuhn

R for Predictive Modeling: A Hands-On Introduction

Predictive Analytics World in San Francisco
Sunday March 13, 9am to 4:30pm

This one-day session provides a hands-on introduction to R, the
well-known open-source platform for data analysis. Real examples are
employed in order to methodically expose attendees to best practices
driving R and its rich set of predictive modeling packages, providing
hands-on experience and know-how. R is compared to other data analysis
platforms, and common pitfalls in using R are addressed.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ROC from R-SVM?

2011-02-22 Thread Max Kuhn

The objects functions for kernel methods are unrelated to the area
under the ROC curve. However, you can try to choose the cost and
kernel parameters to maximize the ROC AUC.

See the caret package, specifically the train function.

Max

On Mon, Feb 21, 2011 at 5:34 PM, Angel Russo angerusso1...@gmail.com wrote:
 *Hi,

 *Does anyone know how can I show an *ROC curve for R-SVM*? I understand in
 R-SVM we are not optimizing over SVM cost parameter. Any example ROC for
 R-SVM code or guidance can be really useful.

 Thanks, Angel.

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest Cross Validation

2011-02-20 Thread Max Kuhn

 I am using randomForest package to do some prediction job on GWAS data. I
 firstly split the data into training and testing set (70% vs 30%), then
 using training set to grow the trees (ntree=10). It looks that the OOB
 error in training set is good (10%). However, it is not very good for the
 test set with a AUC only about 50%.

Did you do any feature selection in the training set? If so, you also
need to include that step in the cross-validation to get realistic
performance estimates (see Ambroise and McLachlan. Selection bias in
gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences (2002) vol. 99 (10)
pp. 6562-6566).

In the caret package, train() can be used to get cross-validation
estimates for RF and the sbf() function (for selection by filter) can
be used to include simple univariate filters in the CV procedure.

 Although some people said no cross-validation was necessary for RF, I still
 felt unsafe and thought a testing set is important. I felt really frustrated
 with the results.

CV is needed when you want an assessment of performance on a test set.
In this sense, RF is like any other method.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] caret::train() and ctree()

2011-02-16 Thread Max Kuhn

Andrew,

ctree only tunes over mincriterion and ctree2 tunes over maxdepth
(while fixing mincriterion = 0).

Seeing both listed as the function is being executed is a bug. I'll
setup checks to make sure that the columns specified in tuneGrid are
actually the tuning parameters that are used.

Max

On Wed, Feb 16, 2011 at 12:01 PM, Andrew Ziem az...@us.ci.org wrote:
 Like earth can be trained simultaneously for degree and nprune, is there a 
 way to train ctree simultaneously for mincriterion and maxdepth?

 Also, I notice there are separate methods ctree and ctree2, and if both 
 options are attempted to tune with one method, the summary averages the 
 option it doesn't support.  The full log is attached, and notice these lines 
 below for method=ctree where maxdepth=c(2,4) are averaged to maxdepth=3.

 Fitting: maxdepth=2, mincriterion=0.95
 Fitting: maxdepth=4, mincriterion=0.95
 Fitting: maxdepth=2, mincriterion=0.99
 Fitting: maxdepth=4, mincriterion=0.99

  mincriterion  Accuracy  Kappa  maxdepth  Accuracy SD  Kappa SD  maxdepth SD
  0.95          0.939     0.867  3         0.0156       0.0337    1.01
  0.99          0.94      0.868  3         0.0157       0.0337    1.01

 I use R 2.12.1 and caret 4.78.

 Andrew



 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Train error:: subscript out of bonds

2011-01-26 Thread Max Kuhn

Sort of. It lets you define a grid of candidate values to test and to
define the rule to choose the best. For some models, it is each to
come up with default values that work well (e.g. RBF SVM's, PLS, KNN)
while others are more data dependent. In the latter case, the defaults
may not work well.

MAx

On Wed, Jan 26, 2011 at 5:45 AM, Neeti nikkiha...@gmail.com wrote:

 What I have understood in CARET train() method is that train() itself does
 the model selection and tune the parameter. (please correct me if I am
 wrong). That was my first motivation to select this package and method for
 fitting the model. And use the parameter to e1071 svm() method and compare
 the result.

 fit1-train(train1,as.factor(trainset[,ncol(trainset)]),svmpoly,trControl
 = trainControl((method = cv),10,verboseIter = F),tuneLength=3)

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3237800.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Train error:: subscript out of bonds

2011-01-26 Thread Max Kuhn

No. Any valid seed should work. In this case, train() should on;y be
using it to determine which training set samples are in the CV or
bootstrap data sets.

Max

On Wed, Jan 26, 2011 at 9:56 AM, Neeti nikkiha...@gmail.com wrote:

 Thank you so much for your reply. In my case it is giving error in some seed
 value for example if I set seed value to 357 this gives an error. Does train
 have some specific seed range?
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3238197.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Train error:: subscript out of bonds

2011-01-25 Thread Max Kuhn

What version of caret and R? We'll also need a reproducible example.


On Mon, Jan 24, 2011 at 12:44 PM, Neeti nikkiha...@gmail.com wrote:

 Hi,
 I am trying to construct a svmpoly model using the caret package (please
 see code below). Using the same data, without changing any setting, I am
 just changing the seed value. Sometimes it constructs the model
 successfully, and sometimes I get an “Error in indexes[[j]] : subscript out
 of bounds”.
 For example when I set seed to 357 following code produced result only for 8
 iterations and for 9th iteration it reaches to an error that “subscript out
 of bonds” error. I don’t understand why

 Any help would be great
 thanks
 ###
 for (i in 1:10)
  {
 fit1-NULL;
 x-NULL;
  x-which(number==i)
        trainset-d[-x,]
        testset-d[x,]
 train1-trainset[,-ncol(trainset)]
        train1-train1[,-(1)]
        test_t-testset[,-ncol(testset)]
        species_test-as.factor(testset[,ncol(testset)])
        test_t-test_t[,-(1)]
        
        #CARET::TRAIN
        

        
 fit1-train(train1,as.factor(trainset[,ncol(trainset)]),svmpoly,trControl
 = trainControl((method = cv),10,verboseIter = F),tuneLength=3)
        pred-predict(fit1,test_t)
        t_train[[i]]-table(predicted=pred,observed=testset[,ncol(testset)])
 tune_result[[i]]-fit1$results;
        tune_best-fit1$bestTune;
        scale1[i]-tune_best[[3]]
        degree[i]-tune_best[[2]]
        c1[i]-tune_best[[1]]

        }


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Train-error-subscript-out-of-bonds-tp3234510p3234510.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] circular reference lines in splom

2011-01-20 Thread Max Kuhn

This did the trick:

panel.circ3 - function(...)
  {
args - list(...)
circ1 - ellipse(diag(rep(1, 2)), t = 1)
panel.xyplot(circ1[,1], circ1[,2],
 type = l,
 lty = trellis.par.get(reference.line)$lty,
 col = trellis.par.get(reference.line)$col,
 lwd = trellis.par.get(reference.line)$lwd)
circ2 - ellipse(diag(rep(1, 2)), t = 2)
panel.xyplot(circ2[,1], circ2[,2],
 type = l,
 lty = trellis.par.get(reference.line)$lty,
 col = trellis.par.get(reference.line)$col,
 lwd = trellis.par.get(reference.line)$lwd)
panel.xyplot(args$x, args$y,
 groups = args$groups,
 subscripts = args$subscripts)
  }


splom(~dat, groups = grps,
  lower.panel = panel.circ3,
  upper.panel = panel.circ3)


Thanks,

Max

On Thu, Jan 20, 2011 at 11:13 AM, Peter Ehlers ehl...@ucalgary.ca wrote:
 On 2011-01-19 20:15, Max Kuhn wrote:

 Hello everyone,

 I'm stumped. I'd like to create a scatterplot matrix with circular
 reference lines. Here is an example in 2d:

 library(ellipse)

 set.seed(1)
 dat- matrix(rnorm(300), ncol = 3)
 colnames(dat)- c(X1, X2, X3)
 dat- as.data.frame(dat)
 grps- factor(rep(letters[1:4], 25))

 panel.circ- function(x, y, ...)
   {
     circ1- ellipse(diag(rep(1, 2)), t = 1)
     panel.xyplot(circ1[,1], circ1[,2],
                  type = l,
                  lty = 2)
     circ2- ellipse(diag(rep(1, 2)), t = 2)
     panel.xyplot(circ2[,1], circ2[,2],
                  type = l,
                  lty = 2)
     panel.xyplot(x, y)
   }

 xyplot(X2 ~ X1, data = dat,
        panel = panel.circ,
        aspect = 1)

 I'd like to to the sample with splom, but with groups.

 My latest attempt:

 panel.circ2- function(x, y, groups, ...)
   {
     circ1- ellipse(diag(rep(1, 2)), t = 1)
     panel.xyplot(circ1[,1], circ1[,2],
                  type = l,
                  lty = 2)
     circ2- ellipse(diag(rep(1, 2)), t = 2)
     panel.xyplot(circ2[,1], circ2[,2],
                  type = l,
                  lty = 2)
     panel.xyplot(x, y, type = p, groups)
   }



 splom(~dat,
       panel = panel.superpose,
       panel.groups = panel.circ2)

 produces nothing but warnings:

 warnings()

 Warning messages:
 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

 It does not appear to me that panel.circ2 is even being called.

 Thanks,

 Max

 I don't see a function panel.groups() in lattice.
 Does this do what you want or am I missing the point:

  splom(~dat|grps, panel = panel.circ2)

 Peter Ehlers




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] circular reference lines in splom

2011-01-19 Thread Max Kuhn

Hello everyone,

I'm stumped. I'd like to create a scatterplot matrix with circular
reference lines. Here is an example in 2d:

library(ellipse)

set.seed(1)
dat - matrix(rnorm(300), ncol = 3)
colnames(dat) - c(X1, X2, X3)
dat - as.data.frame(dat)
grps - factor(rep(letters[1:4], 25))

panel.circ - function(x, y, ...)
  {
circ1 - ellipse(diag(rep(1, 2)), t = 1)
panel.xyplot(circ1[,1], circ1[,2],
 type = l,
 lty = 2)
circ2 - ellipse(diag(rep(1, 2)), t = 2)
panel.xyplot(circ2[,1], circ2[,2],
 type = l,
 lty = 2)
panel.xyplot(x, y)
  }

xyplot(X2 ~ X1, data = dat,
   panel = panel.circ,
   aspect = 1)

I'd like to to the sample with splom, but with groups.

My latest attempt:

panel.circ2 - function(x, y, groups, ...)
  {
circ1 - ellipse(diag(rep(1, 2)), t = 1)
panel.xyplot(circ1[,1], circ1[,2],
 type = l,
 lty = 2)
circ2 - ellipse(diag(rep(1, 2)), t = 2)
panel.xyplot(circ2[,1], circ2[,2],
 type = l,
 lty = 2)
panel.xyplot(x, y, type = p, groups)
  }



splom(~dat,
  panel = panel.superpose,
  panel.groups = panel.circ2)

produces nothing but warnings:

 warnings()
Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

It does not appear to me that panel.circ2 is even being called.

Thanks,

Max

 sessionInfo()
R version 2.11.1 Patched (2010-09-30 r53356)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] lattice_0.19-11 ellipse_0.3-5

loaded via a namespace (and not attached):
[1] grid_2.11.1  tools_2.11.1



-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] less than full rank contrast methods

2010-12-06 Thread Max Kuhn

I'd like to make a less than full rank design using dummy variables
for factors. Here is some example data:

when - data.frame(time = c(afternoon, night, afternoon,
morning, morning, morning,
morning, afternoon, afternoon),
   day = c(Monday, Monday, Monday,
   Wednesday, Wednesday, Friday,
   Saturday, Saturday, Friday))

For a single factor, I can do this this using

 head(model.matrix(~time -1, data = when))
  timeafternoon timemorning timenight
1 1   0 0
2 0   0 1
3 1   0 0
4 0   1 0
5 0   1 0
6 0   1 0

but this breakdown muti-variable formulas such as time + day or
time + dat + time:day.

I've looked for alternate contrast functions to do this and I haven't
figured out a way to coerce existing functions to get the desired
output. Hopefully I haven't missed anything obvious.

Thanks,

Max

 sessionInfo()
R version 2.11.1 Patched (2010-09-11 r52910)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Sporadic errors when training models using CARET

2010-11-23 Thread Max Kuhn

Kendric,

I've seen these too and traceback() usually goes back to ksvm(). This
doesn't mean that the error is there, but the results fo traceback()
from you would be helpful.

thanks,

Max

On Mon, Nov 22, 2010 at 6:18 PM, Kendric Wang
kendr...@interchange.ubc.ca wrote:
 Hi. I am trying to construct a svmLinear model using the caret package
 (see code below). Using the same data, without changing any setting,
 sometimes it constructs the model successfully, and sometimes I get an index
 out of bounds error. Is this unexpected behaviour? I would appreciate any
 insights this issue.


 Thanks.
 ~Kendric


 train.y
  [1] S S S S R R R R R R R R R R R R R R R R R R R R
 Levels: R S

 train.x
        m1      m2
 1   0.1756  0.6502
 2   0.1110 -0.2217
 3   0.0837 -0.1809
 4  -0.3703 -0.2476
 5   8.3825  2.8814
 6   5.6400 12.9922
 7   7.5537  7.4809
 8   3.5005  5.7844
 9  16.8541 16.6326
 10  9.1851  8.7814
 11  1.4405 11.0132
 12  9.8795  2.6182
 13  8.7151  4.5476
 14 -0.2092 -0.7601
 15  3.6876  2.5772
 16  8.3776  5.0882
 17  8.6567  7.2640
 18 20.9386 20.1107
 19 12.2903  4.7864
 20 10.5920  7.5204
 21 10.2679  9.5493
 22  6.2023 11.2333
 23 -5.0720 -4.8701
 24  6.6417 11.5139

 svmLinearGrid - expand.grid(.C=0.1)
 svmLinearFit - train(train.x, train.y, method=svmLinear,
 tuneGrid=svmLinearGrid)
 Fitting: C=0.1
 Error in indexes[[j]] : subscript out of bounds

 svmLinearFit - train(train.x, train.y, method=svmLinear,
 tuneGrid=svmLinearGrid)
 Fitting: C=0.1
 maximum number of iterations reached 0.0005031579 0.0005026807maximum number
 of iterations reached 0.0002505857 0.0002506714Error in indexes[[j]] :
 subscript out of bounds

 svmLinearFit - train(train.x, train.y, method=svmLinear,
 tuneGrid=svmLinearGrid)
 Fitting: C=0.1
 maximum number of iterations reached 0.0003270061 0.0003269764maximum number
 of iterations reached 7.887867e-05 7.866367e-05maximum number of iterations
 reached 0.0004087571 0.0004087466Aggregating results
 Selecting tuning parameters
 Fitting model on full training set


 R version 2.11.1 (2010-05-31)
 x86_64-redhat-linux-gnu

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] splines   stats     graphics  grDevices utils     datasets  methods
 [8] base

 other attached packages:
  [1] kernlab_0.9-12  pamr_1.47       survival_2.35-8 cluster_1.12.3
  [5] e1071_1.5-24    class_7.3-2     caret_4.70      reshape_0.8.3
  [9] plyr_1.2.1      lattice_0.18-8

 loaded via a namespace (and not attached):
 [1] grid_2.11.1


 --
 MSc. Candidate
 CIHR/MSFHR Training Program in Bioinformatics
 University of British Columbia

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cross validation using e1071:SVM

2010-11-23 Thread Max Kuhn

Neeti,

I'm pretty sure that the error is related to the confusionMAtrix call,
which is in the caret package, not e1071.

The error message is pretty clear: you need to pas in two factor
objects that have the same levels. You can check by running the
commands:

   str(pred_true1)
   str(species_test)

Also, caret can do the resampling for you instead of you writing the
loop yourself.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave - Format error discovered in the file in sub-document content.xml at 2, 4047 (row, col)

2010-11-16 Thread Max Kuhn

Can you try it with version 7.16 on R-Forge? Use

 install.packages(odfWeave, repos=http://R-Forge.R-project.org;)

to get it.

Thanks,

Max

On Tue, Nov 16, 2010 at 8:26 AM, Søren Højsgaard
soren.hojsga...@agrsci.dk wrote:
 Dear Mike,

 Good point - thanks. The lines that caused the error mentioned above are 
 simply:

 =
 x - 1:10
 x
 @

 I could add that the document 'simple.odt' (which comes with odfWeave) causes 
 the same error - but at row=109, col=1577

 sessionInfo()
 R version 2.12.0 (2010-10-15)
 Platform: x86_64-pc-mingw32/x64 (64-bit)

 locale:
 [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    
 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                    
 LC_TIME=Danish_Denmark.1252

 attached base packages:
 [1] grid      stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] MASS_7.3-8      odfWeave_0.7.14 XML_3.2-0.1     lattice_0.19-13

 loaded via a namespace (and not attached):
 [1] tools_2.12.0


 Regards
 Søren

 -Oprindelig meddelelse-
 Fra: Mike Marchywka [mailto:marchy...@hotmail.com]
 Sendt: 16. november 2010 12:56
 Til: Søren Højsgaard; r-h...@stat.math.ethz.ch
 Emne: RE: [R] odfWeave - Format error discovered in the file in sub-document 
 content.xml at 2, 4047 (row, col)








 
 From: soren.hojsga...@agrsci.dk
 To: r-h...@stat.math.ethz.ch
 Date: Tue, 16 Nov 2010 11:32:06 +0100
 Subject: [R] odfWeave - Format error discovered in the file in sub-document 
 content.xml at 2, 4047 (row, col)


 When using odfWeave on an OpenOffice input document, I can not open the 
 output document. I get the message

 Format error discovered in the file in sub-document content.xml at 2,4047 
 (row,col)

 Can anyone help me on this? (Apologies if this has been discussed before; I 
 have not been able to find any info...)

 well, if it really means line 2 you could post the first few lines. Did you 
 expect a line
 with 4047 columns?




 Info:
 I am using R.2.12.0 on Windows 7 (64 bit). I have downloaded the XML package 
 from http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.12/ and I have 
 compiled odfWeave myself

 Best regards
 Søren

        [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] to determine the variable importance in svm

2010-10-26 Thread Max Kuhn

 The caret package has answers to all your questions.

 1) How to obtain a variable (attribute) importance using
 e1071:SVM (or other
 svm methods)?

I haven't implemented a model-specific method for variables importance
for SVM models. I know of one package (svmpath) that will return the
regression coefficients (e.g. the \beta values of x'\beta) for two
class models. There are probably other methods for non-linear kernels,
but I haven't coded anything (any volunteers?).

When there is no variable importance method implemented for
classification models, caret calculates an ROC curve for each
predictor and returns the AUC. For 3+ classes, it returns the maximum
AUC on the one-vs-all ROC curves.

Note also that caret uses ksvm in kernlab for no other reason that it
has a bunch of available kernels and similar methods (rvm, etc)

 2) how to validate the results of svm?

If you use caret, you can look at:

  http://user2010.org/slides/Kuhn.pdf
  http://www.jstatsoft.org/v28/i05

and the four package vignettes.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest AUC

2010-10-22 Thread Max Kuhn

Ravishankar,

 I used Random Forest with a couple of data sets I had to predict for binary
 response. In all the cases, the AUC of the training set is coming to be 1.
 Is this always the case with random forests? Can someone please clarify
 this?

This is pretty typical for this model.

 I have given a simple example, first using logistic regression and then
 using random forests to explain the problem. AUC of the random forest is
 coming out to be 1.

Logistic regression isn't as flexible as RF and some other methods, so
the ROC curve is likely to be less than one, but much higher than it
really is (since you are re-predicting the same data)

For you example:

 performance(prediction(train.predict,iris$Species),auc)@y.values[[1]]
[1] 0.9972

but using simple 10-fold CV:

 library(caret)
 ctrl - trainControl(method = cv,
+  classProbs = TRUE,
+  summaryFunction = twoClassSummary)

 set.seed(1)
 cvEstimate - train(Species ~ ., data = iris,
+ method = glm,
+ metric = ROC,
+ trControl = ctrl)
Fitting: parameter=none
Aggregating results
Fitting model on full training set
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: algorithm did not converge
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
 cvEstimate

Call:
train.formula(form = Species ~ ., data = iris, method = glm,
metric = ROC, trControl = ctrl)

100 samples
  4 predictors

Pre-processing:
Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results

  Sens  Spec  ROC   Sens SD  Spec SD  ROC SD
  0.96  0.98  0.86  0.0843   0.0632   0.126

and for random forest:

 set.seed(1)
 rfEstimate - train(Species ~ .,
+ data = iris,
+ method = rf,
+ metric = ROC,
+ tuneGrid = data.frame(.mtry = 2),
+ trControl = ctrl)
Fitting: mtry=2
Aggregating results
Selecting tuning parameters
Fitting model on full training set
 rfEstimate

Call:
train.formula(form = Species ~ ., data = iris, method = rf,
metric = ROC, tuneGrid = data.frame(.mtry = 2), trControl = ctrl)

100 samples
  4 predictors

Pre-processing:
Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results

  Sens  Spec  ROCSens SD  Spec SD  ROC SD
  0.94  0.92  0.898  0.0966   0.14 0.00632

Tuning parameter 'mtry' was held constant at a value of 2

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Understanding linear contrasts in Anova using R

2010-09-30 Thread Max Kuhn

These two resources might also help:

   http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
   http://cran.r-project.org/web/packages/contrast/vignettes/contrast.pdf

Max


On Thu, Sep 30, 2010 at 1:33 PM, Ista Zahn iz...@psych.rochester.edu wrote:
 Hi Professor Howell,
 I think the issue here is simply in the assumption that the regression
 coefficients will always be equal to the product of the means and the
 contrast codes. I tend to think of regression coefficients as the
 quotient of the covariance of x and y divided by the variance of x,
 and this definition agrees with the coefficients calculated by lm().
 See below for a long-winded example.

 On Wed, Sep 29, 2010 at 3:42 PM, David Howell david.how...@uvm.edu wrote:
  #I am trying to understand how R fits models for contrasts in a
 #simple one-way anova. This is an example, I am not stupid enough to want
 #to simultaneously apply all of these contrasts to real data. With a few
 #exceptions, the tests that I would compute by hand (or by other software)
 #will give the same t or F statistics. It is the contrast estimates that R
 produces
 #that I can't seem to understand.
 #
 # In searching for answers to this problem, I found a great PowerPoint slide
 (I think by John Fox).
 # The slide pointed to the coefficients, said something like these are
 coeff. that no one could love, and
 #then suggested looking at the means to understand where they came from. I
 have stared
 # and stared at his means and then my means, but can't find a relationship.

 # The following code and output illustrates the problem.

 # Various examples of Anova using R

 dv - c(1.28,  1.35,  3.31,  3.06,  2.59,  3.25,  2.98,  1.53, -2.68,  2.64,
  1.26,  1.06,
       -1.18,  0.15,  1.36,  2.61,  0.66,  1.32,  0.73, -1.06,  0.24,  0.27,
  0.72,  2.28,
       -0.41, -1.25, -1.33, -0.47, -0.60, -1.72, -1.74, -0.77, -0.41, -1.20,
 -0.31, -0.74,
       -0.45,  0.54, -0.98,  1.68,  2.25, -0.19, -0.90,  0.78,  0.05,  2.69,
  0.15,  0.91,
        2.01,  0.40,  2.34, -1.80,  5.00,  2.27,  6.47,  2.94,  0.47,  3.22,
  0.01, -0.66)

 group - factor(rep(1:5, each = 12))


 # Use treatment contrasts to compare each group to the first group.
 options(contrasts = c(contr.treatment,contr.poly))  # The default
 model2 - lm(dv ~ group)
 summary(model2)
  # Summary table is the same--as it should be
  # Intercept is Group 1 mean and other coeff. are deviations from that.
  # This is what I would expect.
  #summary(model1)
  #              Df Sum Sq Mean Sq F value    Pr(F)
  #  group        4  62.46 15.6151  6.9005 0.0001415 ***
  #  Residuals   55 124.46  2.2629
  #Coefficients:
  #            Estimate Std. Error t value Pr(|t|)
  #(Intercept)  1.80250    0.43425   4.151 0.000116 ***
  #group2      -1.12750    0.61412  -1.836 0.071772 .
  #group3      -2.71500    0.61412  -4.421 4.67e-05 ***
  #group4      -1.25833    0.61412  -2.049 0.045245 *
  #group5       0.08667    0.61412   0.141 0.888288


 # Use sum contrasts to compare each group against grand mean.
 options(contrasts = c(contr.sum,contr.poly))
 model3 - lm(dv ~ group)
 summary(model3)

  # Again, this is as expected. Intercept is grand mean and others are
 deviatoions from that.
  #Coefficients:
  #              Estimate Std. Error t value Pr(|t|)
  #  (Intercept)   0.7997     0.1942   4.118 0.000130 ***
  #  group1        1.0028     0.3884   2.582 0.012519 *
  #  group2       -0.1247     0.3884  -0.321 0.749449
  #  group3       -1.7122     0.3884  -4.408 4.88e-05 ***
  #  group4       -0.2555     0.3884  -0.658 0.513399

 #SO FAR, SO GOOD

 # IF I wanted polynomial contrasts BY HAND I would use
 #    a(i) =  -2   -1   0   1   2   for linear contrast        (or some
 linear function of this )
 #    Effect = Sum(a(j)M(i))    # where M = mean
 #    Effect(linear) = -2(1.805) -1(0.675) +0(-.912) +1(.544) +2(1.889) =
 0.043
 #    SS(linear) = n*(Effect(linear)^2)/Sum((a(j)^2))  = 12(.043)/10 = .002
 #    F(linear) = SS(linear)/MS(error) = .002/2.263 = .001
 #    t(linear) = sqrt(.001) = .031

 # To do this in R I would use
 order.group - ordered(group)
 model4 - lm(dv~order.group)
 summary(model4)
 #  This gives:
    #Coefficients:
 #                  Estimate Std. Error t value Pr(|t|)
 #    (Intercept)    0.79967    0.19420   4.118 0.000130 ***
 #    order.group.L  0.01344    0.43425   0.031 0.975422
 #    order.group.Q  2.13519    0.43425   4.917 8.32e-06 ***
 #    order.group.C  0.11015    0.43425   0.254 0.800703
 #    order.group^4 -0.79602    0.43425  -1.833 0.072202 .

 # The t value for linear is same as I got (as are others) but I don't
 understand
 # the estimates. The intercept is the grand mean, but I don't see the
 relationship
 # of other estimates to that or to the ones I get by hand.
 # My estimates are the sum of (coeff times means) i.e.  0 (intercept),
 .0425, 7.989, .3483, -6.66
 # and these are not a linear (or other nice pretty) function of est. from R.

 # OK, let's break it down
 Means - tapply(dv, order.group,

Re: [R] Creating publication-quality plots for use in Microsoft Word

2010-09-15 Thread Max Kuhn

You might want to check out the Reproducible Research task view:

   http://cran.r-project.org/web/views/ReproducibleResearch.html

There is a section on Microsoft formats, as well as other formats that
can be converted.

Max



On Wed, Sep 15, 2010 at 11:49 AM, Thomas Lumley
tlum...@u.washington.edu wrote:
 On Wed, 15 Sep 2010, dadrivr wrote:


 Thanks for your help, guys.  I'm looking to produce a high-quality plot
 (no
 jagged lines or other distortions) with a filetype that is accepted by
 Microsoft Word on a PC and that most journals will accept.  That's why I'd
 prefer to stick with JPEG, TIFF, PNG, or the like.  I'm not sure EPS would
 fly.

 One simple approach, which I use when I have to create graphics for MS
 Office while on a non-Windows platform is to use PNG and set the resolution
 and file size large enough.  At 300dpi or so the physics of ink on paper
 does all the antialiasing you need.

 Work out how big you want the graph to be, and use PNG with enough pixels to
 get at least 300dpi at that final size. You'll need to set the pointsize
 argument and it will help to set the resolution argument.

     -thomas

 Thomas Lumley
 Professor of Biostatistics
 University of Washington, Seattle

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Reproducible research

2010-09-09 Thread Max Kuhn

A Reproducible Research CRAN task view was recently created:

   http://cran.r-project.org/web/views/ReproducibleResearch.html

I will be updating it with some of the information in this thread.

thanks,

Max



On Thu, Sep 9, 2010 at 11:41 AM, Matt Shotwell shotw...@musc.edu wrote:
 Well, the attachment was a dud. Try this:

 http://biostatmatt.com/R/markup_0.0.tar.gz

 -Matt

 On Thu, 2010-09-09 at 10:54 -0400, Matt Shotwell wrote:
 I have a little package I've been using to write template blog posts (in
 HTML) with embedded R code. It's quite small but very flexible and
 extensible, and aims to do something similar to Sweave and brew. In
 fact, the package is heavily influenced by the brew package, though
 implemented quite differently. It depends on the evaluate package,
 available in the CRAN. The tentatively titled 'markup' package is
 attached. After it's installed, see ?markup and the few examples in the
 inst/ directory, or just example(markup).

 -Matt

 On Thu, 2010-09-09 at 01:47 -0400, David Scott wrote:
  I am investigating some approaches to reproducible research. I need in
  the end to produce .html or .doc or .docx. I have used hwriter in the
  past but have had some problems with verbatim output from  R. Tables are
  also not particularly convenient.
 
  I am interested in R2HTML and R2wd in particular, and possibly odfWeave.
 
  Does anyone have sample documents using any of these approaches which
  they could let me have?
 
  David Scott
 
  _
 
  David Scott Department of Statistics
              The University of Auckland, PB 92019
              Auckland 1142,    NEW ZEALAND
  Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
  Email:      d.sc...@auckland.ac.nz,  Fax: +64 9 373 7018
 
  Director of Consulting, Department of Statistics
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.


 --
 Matthew S. Shotwell
 Graduate Student
 Division of Biostatistics and Epidemiology
 Medical University of South Carolina

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] createDataPartition

2010-09-09 Thread Max Kuhn

Trafim,

You'll get more answers if you adhere to the posting guide and tell us
you version information and other necessary details. For example, this
function is in the caret package (but nobody but me probably knows
that =]).

The first argument should be a vector of outcome values (not the
possible classes).

For the iris data, this means something like:

   createDataPartition(iris$Species)

if you were trying to predict the species. The function does
stratified splitting; the data are split into training and test sets
within each class, then the results are aggregated to get the entire
training set indicators. Setting a proportion per class won't do
anything.

Look at the man page or the (4) package vignettes for examples.

Max

On Thu, Sep 9, 2010 at 7:52 AM, Trafim Vanishek rdapam...@gmail.com wrote:
 Dear all,

 does anyone know how to define the structure of the required samples using
 function createDataPartition, meaning proportions of different types of
 variable in the partition?
 Smth like this for iris data:

 createDataPartition(y = c(setosa = .5, virginica = .3, versicolor = .2),
 times = 10, p = .7, list = FALSE)

 Thanks a lot for your help.

 Regards,
 Trafim

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] several odfWeave questions

2010-08-25 Thread Max Kuhn

Ben,

  1a. am I right in believing that odfWeave does not respect the
 'keep.source' option?  Am I missing something obvious?

I believe it does, since this gets passed directly to Sweave.

  1b. is there a way to set global options analogous to \SweaveOpts{}
 directives in Sweave? (I looked at odfWeaveControl, it doesn't seem to
 do it.)

Yes. There are examples of this in the 'examples' package directory.

  2. I tried to write a Makefile directive to process files from the
 command line:

 %.odt: %_in.odt
        $(RSCRIPT) -e library(odfWeave);
 odfWeave(\$*_in.odt\,\$*.odt\);

  This works, *but* the resulting output file gives a warning (The file
 'odftest2.odt' is corrupt and therefore cannot be opened.
 OpenOffice.org can try to repair the file ...).  Based on looking at
 the contents, it seems that a spurious/unnecessary 'Rplots.pdf' file is 
 getting
 created and zipped in with the rest of the archive; when I unzip, delete
 the Rplots.pdf file and re-zip, the ODT file opens without a warning.
 Obviously I could post-process but it would be nicer to find a
 workaround within R ...

Get the latest version form R-Forge. I haven't gotten this fix onto
CRAN yet (I've been on a caret streak lately).

  3. I find the requirement that all file paths be specified as absolute
 rather than relative paths somewhat annoying -- I understand the reason,
 but it goes against one practice that I try to encourage for
 reproducibility, which is *not* to use absolute file paths -- when
 moving a same set of data and analysis files across computers, it's hard
 to enforce them all ending up in the same absolute location, which then
 means that the recipient has to edit the ODT file.  It would be nice if
 there were hooks for read.table() and load() as there are for plotting
 and package/namespace loading -- then one could just copy them into the
 working directory on the fly.
   has anyone experienced this/thought of any workarounds?
  (I guess one solution is to zip any necessary source files into the archive 
 beforehand,
 as illustrated in the vignette.)

You can set the working directory with the (wait for it...) 'workDir'
argument. Using 'workDir = getwd()' will pack and unpack the files in
the current location and you wouldn't need to worry about setting the
path. I use the temp directory because I started over-wrting files.

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] odfWeave Issue.

2010-08-11 Thread Max Kuhn

 What does this mean?

It's impossible to tell. Read the posting guide and figure out all the
details that you left out. If we don't have more information, you
should have low expectations about the quality of any replies to might
get.

-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] UseR! 2010 - my impressions

2010-07-27 Thread Max Kuhn

Not to beat a dead horse...

I've found that I like the useR conferences more than most statistics
conferences. This isn't due to the difference in content, but the
difference in the audience and the environment.

For example, everyone is at useR because of their appreciation of R.
At most other conferences, there is a much wider focus of topics and
less group cohesion. Given this, I think that the environment is
more congenial. I've had many discussions with people that are in
completely different fields than myself (e.g. imaging, forestry,
physics, etc) that would be less likely to occur at other scientific
meetings.

Another difference between useR and the average (statistics)
conference is the network effect is stronger. I believe that there is
a much higher likelihood that a random person is acquainted with a
different random attendee. This could be because of we've used their
package, they run a local RUG or they are one of the principal people
who drive R (Uwe, Kurt, etc).

Anyway, well done.

Max


On Mon, Jul 26, 2010 at 11:49 AM, Tal Galili tal.gal...@gmail.com wrote:
 Dear Ravi - I echo everything you wrote, useR2010 was an amazing experience
 (for me, and for many others with whom I have spoken about it).
 Many thanks should go to the wonderful people who put their efforts into
 making this conference a reality (and Kate is certainly one of them).
 Thank you for expressing feelings I had using your own words.

 Best,
 Tal


 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
 www.r-statistics.com (English)
 --




 On Sat, Jul 24, 2010 at 2:50 AM, Ravi Varadhan rvarad...@jhmi.edu wrote:

 Dear UseRs!,

 Everything about UseR! 2010 was terrific!  I really mean everything - the
 tutorials, invited talks, kaleidoscope sessions, focus sessions, breakfast,
 snacks, lunch, conference dinner, shuttle services, and the participants.
 The organization was fabulous.  NIST were gracious hosts, and provided top
 notch facilities.  The rousing speech by Antonio Possolo, who is the chief
 of Statistical Engineering Division at NIST, set the tempo for the entire
 conference.  Excellent invited lectures by Luke Tierney, Frank Harrell, Mark
 Handcock, Diethelm Wurtz, Uwe Ligges, and Fritz Leisch.  All the sessions
 that I attended had many interesting ideas and useful contributions.  During
 the whole time that I was there, I could not help but get the feeling that I
 am a part of something great.

 Before I end, let me add a few words about a special person.  This
 conference would not have been as great as it was without the tireless
 efforts of Kate Mullen.  The great thing about Kate is that she did so much
 without ever hogging the limelight.  Thank you, Kate and thank you NIST!

 I cannot wait for UseR!2011!

 Best,
 Ravi.

 

 Ravi Varadhan, Ph.D.
 Assistant Professor,
 Division of Geriatric Medicine and Gerontology
 School of Medicine
 Johns Hopkins University

 Ph. (410) 502-2619
 email: rvarad...@jhmi.edu

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Max

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest - Strata

2010-07-27 Thread Max Kuhn

The index indicates which samples should go into the training set.
However, you are using out of bag sampling, so it would use the whole
training set and return the OOB error (instead of the error estimates
that would be produced by resampling via the index).

Which do you want? OOB estimates or other estimates? Based on your
previous email, I figured you would have an index list with three sets
of sample indicies for sites A+B, sites A+C and sites B+C. In this way
you would do three resamples: the first fits using data from sites A
B, then predicts on C (and so on). In this way, the resampled error
estimates would be based on the average of the three hold-out sets
(actually hold-out sites). OOB error doesn't sound like what you want.

MAx

On Tue, Jul 27, 2010 at 2:46 PM, Coll gbco...@gmail.com wrote:

 Thanks for all the help.

 I had tried using the index in caret to try to dictate which rows of the
 sample would be used in each of the tree building in RF. (e.g. use all data
 from A B site for training, hold out all data from C site for testing etc)

 However after running, when I cross-checked the index that goes to train
 function and the inbag in the resulting randomForest object, I found the
 two didn't match.

 Shown as below:

 data(iris)
 tmpIrisIndex - createDataPartition(iris$Species, p=0.632, times = 10)
 head(tmpIrisIndex,3)
 [[1]]
  [1]   1   2   3   7  10  11  12  13  16  18  20  22  24  25  26  27  28  29
 31
 [20]  34  35  36  37  38  39  40  41  43  46  47  48  50  52  53  55  56  57
 58
 [39]  61  64  65  66  67  68  69  71  74  75  76  77  79  82  83  84  85  86
 88
 [58]  90  91  92  94  96  98  99 102 103 104 106 108 109 111 112 113 114 115
 116
 [77] 117 119 120 121 123 126 128 129 130 131 132 134 136 139 140 141 143 146
 147
 [96] 150

 [[2]]
  [1]   1   3   6   7   8  10  12  13  14  16  18  20  21  22  23  24  26  27
 28
 [20]  29  30  32  34  35  36  38  42  44  46  47  48  50  51  53  54  55  58
 60
 [39]  61  62  67  68  69  70  72  73  74  76  77  79  81  82  83  85  86  88
 89
 [58]  90  92  93  95  97  99 100 103 104 105 107 108 109 111 112 113 114 117
 119
 [77] 120 121 122 123 124 125 127 130 132 133 134 135 137 139 140 141 142 145
 147
 [96] 149

 [[3]]
  [1]   1   5   7   9  10  11  12  14  18  20  21  22  23  24  26  29  30  31
 33
 [20]  34  35  36  37  38  39  40  44  45  46  47  48  49  51  52  53  54  56
 58
 [39]  61  63  65  66  69  70  72  74  75  76  77  78  79  80  82  83  85  86
 87
 [58]  90  91  92  93  94  98 100 102 103 105 106 107 109 110 113 114 115 116
 117
 [77] 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141 142
 146
 [96] 150

 irisTrControl - trainControl(method = oob, index = tmpIrisIndex)
 rf.iris.obj -train(Species~., data= iris, method = rf, ntree = 10,
 keep.inbag = TRUE, trControl = irisTrControl)
 Fitting: mtry=2
 Fitting: mtry=3
 Fitting: mtry=4
 head(rf.iris.obj$finalModel$inbag,20)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
  [1,]    1    0    1    0    0    0    1    0    1     1
  [2,]    1    1    1    1    1    0    1    0    1     0
  [3,]    1    1    1    0    0    1    1    0    0     0
  [4,]    1    0    1    0    1    1    0    1    0     1
  [5,]    0    1    1    1    1    1    0    1    0     1
  [6,]    1    1    0    1    0    0    1    1    1     0
  [7,]    1    1    0    0    1    1    0    0    0     0
  [8,]    1    1    1    1    1    0    1    1    1     1
  [9,]    1    1    0    1    0    1    0    1    1     0
 [10,]    1    1    1    0    1    1    0    0    0     1
 [11,]    1    1    1    1    1    1    1    0    1     0
 [12,]    1    1    1    1    1    0    1    0    1     1
 [13,]    1    0    1    1    1    1    1    1    0     1
 [14,]    0    1    1    1    0    1    0    0    0     0
 [15,]    1    1    1    1    1    1    1    1    1     0
 [16,]    1    1    0    0    0    0    1    0    1     1
 [17,]    1    0    1    0    0    0    1    1    0     1
 [18,]    1    0    1    1    1    1    1    1    1     1
 [19,]    1    0    1    0    1    1    1    0    1     1
 [20,]    1    0    1    0    1    1    1    0    1     0

 My understanding is the 1st tree in the RF should be built with
 tmpIrisIndex[1] i.e. 1   2   3   7  10  11  12  13  ... ?
 But the Inbag in the resulting forest is showing it is using 1 2 3 4 6 7 8
 9... for inbag in 1st tree?

 Why the index passed to train does not match what got from inbag in the rf
 object? Or I had looked to the wrong place to check this?

 Any help / comments would be appreciated. Thanks a lot.

 Regards,
 Coll



 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-Strata-tp2295731p2303958.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal,

1 2 3 >

1 - 100 of 240 matches

Mail list logo