On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote: > Dylan Beaudette wrote: > > On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote: > >> Note that even though the ROC curve as a whole is an interesting > >> 'statistic' (its area is a linear translation of the > >> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation > >> statistics), each individual point on it is an improper scoring rule, > >> i.e., a rule that is optimized by fitting an inappropriate model. Using > >> curves to select cutoffs is a low-precision and arbitrary operation, and > >> the cutoffs do not replicate from study to study. Probably the worst > >> problem with drawing an ROC curve is that it tempts analysts to try to > >> find cutoffs where none really exist, and it makes analysts ignore the > >> whole field of decision theory. > >> > >> Frank Harrell > > > > Frank, > > > > This thread has caught may attention for a couple reasons, possibly > > related to my novice-level experience. > > > > 1. in a logistic regression study, where i am predicting the probability > > of the response being 1 (for example) - there exists a continuum of > > probability values - and a finite number of {1,0} realities when i either > > look within the original data set, or with a new 'verification' data set. > > I understand that drawing a line through the probabilities returned from > > the logistic regression is a loss of information, but there are times > > when a 'hard' decision requiring prediction of {1,0} is required. I have > > found that the ROCR package (not necessarily the ROC Curve) can be useful > > in identifying the probability cutoff where accuracy is maximized. Is > > this an unreasonable way of using logistic regression as a predictor?
Thanks for the detailed response Frank. My follow-up questions are below: > Logistic regression (with suitable attention to not assuming linearity > and to avoiding overfitting) is a great way to estimate P[Y=1]. Given > good predicted P[Y=1] and utilities (losses, costs) for incorrect > positive and negative decisions, an optimal decision is one that > optimizes expected utility. The ROC curve does not play a direct role > in this regard. Ok. > If per-subject utilities are not available, the analyst > may make various assumptions about utilities (including the unreasonable > but often used assumption that utilities do not vary over subjects) to > find a cutoff on P[Y=1]. Can you elaborate on what exactly a "per-subject utility" is? In my case, I am trying to predict the occurance of specific soil features based on two predictor variables: 1 continuous, the other categorical. Thus far my evaluation of how well this method works is based on how often I can correctly predict (a categorical) quality. > A very nice feature of P[Y=1] is that error > probabilities are self-contained. For example if P[Y=1] = .02 for a > single subject and you predict Y=0, the probability of an error is .02 > by definition. One doesn't need to compute an overall error probability > over the whole distribution of subjects' risks. If the cost of a false > negative is C, the expected cost is .02*C in this example. Interesting. The hang-up that I am having is that I need to predict from {O,1}, as the direct users of this information are not currently interested in in raw probabilities. As far as I know, in order to predict a class from a probability I need use a cutoff... How else can I accomplish this without imposing a cutoff on the entire dataset? One thought, identify a cutoff for each level of the categorical predictor term in the model... (?) > > 2. The ROC curve can be a helpful way of communicating false positives / > > false negatives to other users who are less familiar with the output and > > interpretation of logistic regression. > > What is more useful than that is a rigorous calibration curve estimate > to demonstrate the faithfulness of predicted P[Y=1] and a histogram > showing the distribution of predicted P[Y=1] Ok. I can make that histogram - how would one go about making the 'rigorous calibration curve' ? Note that I have a training set, from which the model is built, and a smaller testing set for evaluation. > . Models that put a lot of > predictions near 0 or 1 are the most discriminating. Calibration curves > and risk distributions are easier to explain than ROC curves. By 'risk discrimination' do you mean said histogram ? > Too often > a statistician will solve for a cutoff on P[Y=1], imposing her own > utility function without querying any subjects. in this case I have picked a cutoff that resulted in the smallest number of incorrectly classified observations , or highest kappa / tau statistics -- the results were very close. > > 3. I have been using the area under the ROC Curve, kendall's tau, and > > cohen's kappa to evaluate the accuracy of a logistic regression based > > prediction, the last two statistics based on a some probability cutoff > > identified before hand. > > ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank > correlation between pred. P[Y=1] and Y) is a measure of pure > discrimination, not a measure of accuracy per se. Rank correlation > (concordance) measures do not require the use of cutoffs. Ok. Hopefully I am not abusing the kappa and tau statistics too badly by using them to evaluate a probability cutoff... (?) > > How does the topic of decision theory relate to some of the circumstances > > described above? Is there a better way to do some of these things? > > See above re: expected loses/utilities. > > Good questions. > > Frank Thanks for the feedback. Cheers, Dylan > > Cheers, > > > > Dylan > > > >> [EMAIL PROTECTED] wrote: > >>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result= > >>>no rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs > >>> > >>> there is a lot of help try help.search("ROC curve") gave > >>> Help files with alias or concept or title matching 'ROC curve' using > >>> fuzzy matching: > >>> > >>> > >>> > >>> granulo(ade4) Granulometric Curves > >>> plot.roc(analogue) Plot ROC curves and > >>> associated diagnostics > >>> roc(analogue) ROC curve analysis > >>> colAUC(caTools) Column-wise Area Under ROC > >>> Curve (AUC) > >>> DProc(DPpackage) Semiparametric Bayesian ROC > >>> curve analysis > >>> cv.enet(elasticnet) Computes K-fold > >>> cross-validated error curve for elastic net > >>> ROC(Epi) Function to compute and draw > >>> ROC-curves. > >>> lroc(epicalc) ROC curve > >>> cv.lars(lars) Computes K-fold > >>> cross-validated error curve for lars > >>> roc.demo(TeachingDemos) Demonstrate ROC curves by > >>> interactively building one > >>> > >>> HTH > >>> see the help and examples those will suffice > >>> > >>> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'. > >>> > >>> > >>> > >>> Regards, > >>> > >>> Gaurav Yadav > >>> +++++++++++ > >>> Assistant Manager, CCIL, Mumbai (India) > >>> Mob: +919821286118 Email: [EMAIL PROTECTED] > >>> Bhagavad Gita: Man is made by his Belief, as He believes, so He is > >>> > >>> > >>> > >>> "Rithesh M. Mohan" <[EMAIL PROTECTED]> > >>> Sent by: [EMAIL PROTECTED] > >>> 07/26/2007 11:26 AM > >>> > >>> To > >>> <R-help@stat.math.ethz.ch> > >>> cc > >>> > >>> Subject > >>> [R] ROC curve in R > >>> > >>> > >>> > >>> > >>> > >>> > >>> Hi, > >>> > >>> > >>> > >>> I need to build ROC curve in R, can you please provide data steps / > >>> code or guide me through it. > >>> > >>> > >>> > >>> Thanks and Regards > >>> > >>> Rithesh M Mohan > >>> > >>> > >>> [[alternative HTML version deleted]] > >> > >> - > >> Frank E Harrell Jr Professor and Chair School of Medicine > >> Department of Biostatistics Vanderbilt > >> University ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.