Dylan Beaudette wrote: > On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote: >> Note that even though the ROC curve as a whole is an interesting >> 'statistic' (its area is a linear translation of the >> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation >> statistics), each individual point on it is an improper scoring rule, >> i.e., a rule that is optimized by fitting an inappropriate model. Using >> curves to select cutoffs is a low-precision and arbitrary operation, and >> the cutoffs do not replicate from study to study. Probably the worst >> problem with drawing an ROC curve is that it tempts analysts to try to >> find cutoffs where none really exist, and it makes analysts ignore the >> whole field of decision theory. >> >> Frank Harrell > > Frank, > > This thread has caught may attention for a couple reasons, possibly related > to > my novice-level experience. > > 1. in a logistic regression study, where i am predicting the probability of > the response being 1 (for example) - there exists a continuum of probability > values - and a finite number of {1,0} realities when i either look within the > original data set, or with a new 'verification' data set. I understand that > drawing a line through the probabilities returned from the logistic > regression is a loss of information, but there are times when a 'hard' > decision requiring prediction of {1,0} is required. I have found that the > ROCR package (not necessarily the ROC Curve) can be useful in identifying the > probability cutoff where accuracy is maximized. Is this an unreasonable way > of using logistic regression as a predictor?
Logistic regression (with suitable attention to not assuming linearity and to avoiding overfitting) is a great way to estimate P[Y=1]. Given good predicted P[Y=1] and utilities (losses, costs) for incorrect positive and negative decisions, an optimal decision is one that optimizes expected utility. The ROC curve does not play a direct role in this regard. If per-subject utilities are not available, the analyst may make various assumptions about utilities (including the unreasonable but often used assumption that utilities do not vary over subjects) to find a cutoff on P[Y=1]. A very nice feature of P[Y=1] is that error probabilities are self-contained. For example if P[Y=1] = .02 for a single subject and you predict Y=0, the probability of an error is .02 by definition. One doesn't need to compute an overall error probability over the whole distribution of subjects' risks. If the cost of a false negative is C, the expected cost is .02*C in this example. > > 2. The ROC curve can be a helpful way of communicating false positives / > false > negatives to other users who are less familiar with the output and > interpretation of logistic regression. What is more useful than that is a rigorous calibration curve estimate to demonstrate the faithfulness of predicted P[Y=1] and a histogram showing the distribution of predicted P[Y=1]. Models that put a lot of predictions near 0 or 1 are the most discriminating. Calibration curves and risk distributions are easier to explain than ROC curves. Too often a statistician will solve for a cutoff on P[Y=1], imposing her own utility function without querying any subjects. > > > 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's > kappa to evaluate the accuracy of a logistic regression based prediction, the > last two statistics based on a some probability cutoff identified before > hand. ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank correlation between pred. P[Y=1] and Y) is a measure of pure discrimination, not a measure of accuracy per se. Rank correlation (concordance) measures do not require the use of cutoffs. > > > How does the topic of decision theory relate to some of the circumstances > described above? Is there a better way to do some of these things? See above re: expected loses/utilities. Good questions. Frank > > Cheers, > > Dylan > > > >> [EMAIL PROTECTED] wrote: >>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=no >>> rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs >>> >>> there is a lot of help try help.search("ROC curve") gave >>> Help files with alias or concept or title matching 'ROC curve' using >>> fuzzy matching: >>> >>> >>> >>> granulo(ade4) Granulometric Curves >>> plot.roc(analogue) Plot ROC curves and associated >>> diagnostics >>> roc(analogue) ROC curve analysis >>> colAUC(caTools) Column-wise Area Under ROC >>> Curve (AUC) >>> DProc(DPpackage) Semiparametric Bayesian ROC >>> curve analysis >>> cv.enet(elasticnet) Computes K-fold cross-validated >>> error curve for elastic net >>> ROC(Epi) Function to compute and draw >>> ROC-curves. >>> lroc(epicalc) ROC curve >>> cv.lars(lars) Computes K-fold cross-validated >>> error curve for lars >>> roc.demo(TeachingDemos) Demonstrate ROC curves by >>> interactively building one >>> >>> HTH >>> see the help and examples those will suffice >>> >>> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'. >>> >>> >>> >>> Regards, >>> >>> Gaurav Yadav >>> +++++++++++ >>> Assistant Manager, CCIL, Mumbai (India) >>> Mob: +919821286118 Email: [EMAIL PROTECTED] >>> Bhagavad Gita: Man is made by his Belief, as He believes, so He is >>> >>> >>> >>> "Rithesh M. Mohan" <[EMAIL PROTECTED]> >>> Sent by: [EMAIL PROTECTED] >>> 07/26/2007 11:26 AM >>> >>> To >>> <R-help@stat.math.ethz.ch> >>> cc >>> >>> Subject >>> [R] ROC curve in R >>> >>> >>> >>> >>> >>> >>> Hi, >>> >>> >>> >>> I need to build ROC curve in R, can you please provide data steps / code >>> or guide me through it. >>> >>> >>> >>> Thanks and Regards >>> >>> Rithesh M Mohan >>> >>> >>> [[alternative HTML version deleted]] >> - >> Frank E Harrell Jr Professor and Chair School of Medicine >> Department of Biostatistics Vanderbilt University >> ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.