Ramón Casero Cañas wrote: > Michael Dewey wrote: > >>At 17:12 09/04/06, Ramón Casero Cañas wrote: >> >>I am not sure what the problem you really want to solve is but it seems >>that >>a) abnormality is rare >>b) the logistic regression predicts it to be rare. >>If you want a prediction system why not try different cut-offs (other >>than 0.5 on the probability scale) and perhaps plot sensitivity and >>specificity to help to choose a cut-off? > > > Thanks for your suggestions, Michael. It took me some time to figure out > how to do this in R (as trivial as it may be for others). Some comments > about what I've done follow, in case anyone is interested. > > The problem is a) abnormality is rare (Prevalence=14%) and b) there is > not much difference in the independent variable between abnormal and > normal. So the logistic regression model predicts that P(abnormal) <= > 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to > decide between normal/abnormal. But you are right, in that another > cut-off point can be chosen. > > For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and > Specificity=52%. They are pretty bad, although for clinical purposes I > would say that Positive/Negative Predictive Values are more interesting. > But then PPV=19% and NPV=90%, which isn't great. As an overall test of > how good the model is for classification I have computed the area under > the ROC, from your suggestion of using Sensitivity and Specificity. > > I couldn't find how to do this directly with R, so I implemented it > myself (it's not difficult but I'm new here). I tried with package ROCR, > but apparently it doesn't cover binary outcomes. > > The area under the ROC is 0.64, so I would say that even though the > model seems to fit the data, it just doesn't allow acceptable > discrimination, not matter what the cut-off point. > > > I have also studied the effect of low prevalence. For this, I used > option ran.gen in the boot function (package boot) to define a function > that resamples the data so that it balances abnormal and normal cases. > > A logistic regression model is fitted to each replicate, to a parametric > bootstrap, and thus compute the bias of the estimates of the model > coefficients, beta0 and beta1. This shows very small bias for beta1, but > a rather large bias for beta0. > > So I would say that prevalence has an effect on beta0, but not beta1. > This is good, because a common measure like the odds ratio depends only > on beta1. > > Cheers, >
This makes me think you are trying to go against maximum likelihood to optimize an improper criterion. Forcing a single cutpoint to be chosen seems to be at the heart of your problem. There's nothing wrong with using probabilities and letting the utility possessor make the final decision. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html