Re: [R] Proper / Improper scoring Rules

Donald Catanzaro, PhD Wed, 12 Aug 2009 23:06:21 -0700

Hi All,

I have done more background research (including Frank's book) so I feelthat my second question is answered. However, as a novice R user Istill have the following problem, accessing the output of predict. Sosimplifying my question, using the example provided in the Designpackage(http://lib.stat.cmu.edu/S/Harrell/help/Design/html/predict.lrm.html) Imight do something like:

# See help for predict.Design for several binary logistic
# regression examples

# Examples of predictions from ordinal models
set.seed(1)
y <- factor(sample(1:3, 400, TRUE), 1:3, c('good','better','best'))
x1 <- runif(400)
x2 <- runif(400)
f <- lrm(y ~ rcs(x1,4)*x2)
predict(f, type="fitted.ind")[1:10,]   #gets Prob(better) and all others

     y=good  y=better    y=best
1  0.3124704 0.3631544 0.3243752
2  0.3676075 0.3594685 0.2729240
3  0.2198274 0.3437416 0.4364309
4  0.3063463 0.3629658 0.3306879
5  0.5171323 0.3136088 0.1692590
6  0.3050115 0.3629071 0.3320813
7  0.3532452 0.3612928 0.2854620
8  0.2933928 0.3621220 0.3444852
9  0.3068595 0.3629867 0.3301538
10 0.6214710 0.2612164 0.1173126

d <- data.frame(x1=.5,x2=.5)
predict(f, d, type="fitted")        # Prob(Y>=j) for new observation

y>=better y>=best0.6906593 0.3275849

predict(f, d, type="fitted.ind")    # Prob(Y=j)

y=good y=better y=best0.3093407 0.3630744 0.3275849


So now if I wanted to do

out <- predict(f, d, type="fitted.ind")>

out

y=good y=better y=best0.3093407 0.3630744 0.3275849

out$"y=better"


Error in out$"y=better" : $ operator is invalid for atomic vectors

y=better is the max, so how do I create something that says that ?(which is not exactly what I want to do but close enough to help mefigure out what R code I need to accomplish the task)


I can push the predictions out to a vector:

out.vector <- as.vector(predict(f, d, type="fitted.ind"))

out.vector


[1] 0.3093407 0.3630744 0.3275849

which gets me part of the way because I can find out max(out.vector) butI still need to know what column the max is in. I think the problem isthat I don't know how to manipulate data frames and vectors in R andneed some guidance

-DonDon Catanzaro, PhDLandscape Ecologist

dgcatanz...@gmail.com
16144 Sigmond Lane
Lowell, AR 72745
479-751-3616



Frank E Harrell Jr wrote:

Donald Catanzaro, PhD wrote:
Hi All,
I am working on some ordinal logistic regresssions using LRM in theDesign package. My response variable has three categories (1,2,3)and after using the creating my model and using a call to predictsome values and I wanted to use a simple .5 cut-off to classify myprobabilities into the categories.
I had two questions:
a) first, I am having trouble directly accessing the probabilitieswhich may have more to do with my lack of experience with R
For instance, my calls
>ologit.three.NoPerFor <- lrm(Threshold.Three ~ TECI , data=CLD,na.action=na.pass)>CLD$Threshold.Predict.Three.NoPerFor<-predict(ologit.three.NoPerFor, newdata=CLD, type="fitted.ind")>CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPerFor.Threshold.Three=1> .5] <- 1Error: unexpected '=' in"CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPerFor.Threshold.Three="
 >
 >
produce an error message and it seems as R does not like the equalsign at all. So how does one access the probabilities so I canclassify them into the categories of 1,2,3 so I can look atperformance of my model ?
use == to check equality
b) which leads me to my next question. I thought that simplycalculating the percent correct off of my predictions would besufficient to look at performance but since my question is very muchin line with this threadhttp://tolstoy.newcastle.edu.au/R/e4/help/08/04/8987.html I am not sosure anymore. I am afraid I did not understand Frank Harrell's lastsuggestion regarding improper scoring rule - can someone point me tosome internet resources that I might be able to review to see why myapproach would not be valid ?
Percent correct will give you misleading answers and is game-able. Itis also ultra-high-variance. Though not a truly proper scoring rule,Somers' Dxy rank correlation (generalization of ROC area) is helpful.Better still: use the log-likelihood and related quantities (deviance,adequacy index as described in my book).
Frank


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Proper / Improper scoring Rules

Reply via email to