Tim Howard wrote: > Dr. Harrell, > Thank you for your response. I had noted, and appreciate, your perspective on > ROC in past listserv entries and am glad to have an opportunity to delve a > little deeper. > > I (and, I think, Jose Daniel Anadon, the original poster of this question) > have a predictive model for the presence of, say, animal_X. This is a spatial > model that can be represented on maps and is based on known locations where > animal_X is present and (usually) known locations where animal_X is absent. > Output of the analysis (using any number of analytic routines, including > logit, randomForest, maximum entropy, mahalanobis distance...) is a full map > where every spot on the map has a probability that that particular location > has the appropriate habitat for animal_x. > > This output can be visualized by just using a color scale (perhaps blue for > low probability to red for high probability), BUT, there are times when we > want to apply a cutoff to this probability output and create a product where > we can say either "yes, animal_X habitat is predicted here" or "no, animal_X > habitat is not predicted here." > > Note this is the final analytic step. There are no later anaylsis steps and > so (possibly) adjustments for multiple comparisons do not come into play. > > Indeed, it seems that using a standard process to find a threshold reduces > the arbitrariness of the probabiliity color scale (at what probability do we > set 'red'? at what probability do we set 'blue'?). > > Are there alternative approaches that reduce the drawbacks you allude to? > > How would you turn a surface of probabilities into a binary surface of yes-no? > > Thank you for your time. > Sincerely, > Tim Howard > > Ecologist > New York Natural Heritage Program
Tim, I think that 'animal_X habitat is predicted here' would hide a lot of useful information, especially "gray zones" or uncertain areas. I think that a continuous mapping of probabilities to a gray scale or to the heat spectrum would work best. Bill Cleveland also has another idea of using 5 saturation levels on each of 2 hues to get 10 levels with easier human discrimination. You might also consider thermometer plots which give some of the most accurate human perception of a continuous variable. For the first 2 ideas you may have to round probabilities to give just 10 intervals (or use deciles). If you choose cutpoints from the data, there is uncertainty from the cutpoint that may have to be taken into account. See for example http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/fehbib.html#roy06dic Frank > > >>>>Frank E Harrell Jr <[EMAIL PROTECTED]> 03/31/06 11:20 AM >>> > > > Choosing cutoffs is frought with difficulties, arbitrariness, > inefficiency, and the necessity to use a complex adjustment for multiple > comparisons in later analysis steps unless the dataset used to generate > the cutoff was so large as could be considered infinite. > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
