Re: [R] Standard error for the area under a smoothed ROC curve?

Frank E Harrell Jr Wed, 12 Jan 2005 11:48:22 -0800

Dan Bolser wrote:

On Wed, 12 Jan 2005, Frank E Harrell Jr wrote:

Dan Bolser wrote:

Hello,

I am making some use of ROC curve analysis.

I find much help on the mailing list, and I have used the Area Under the
Curve (AUC) functions from the ROC function in the bioconductor project...

http://www.bioconductor.org/repository/release1.5/package/Source/ ROC_1.0.13.tar.gz

However, I read here...

http://www.medcalc.be/manual/mpage06-13b.php

"The 95% confidence interval for the area can be used to test the
hypothesis that the theoretical area is 0.5. If the confidence interval
does not include the 0.5 value, then there is evidence that the laboratory
test does have an ability to distinguish between the two groups (Hanley &
McNeil, 1982; Zweig & Campbell, 1993)."

But aside from early on the above article is short on details. Can anyone
tell me how to calculate the CI of the AUC calculation?


I read this...

http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf

Which talks about resampling (by showing R code), but I can't understand
what is going on, or what is calculated (the example given is specific to
microarray analysis I think).

I think a general AUC CI function would be a good addition to the ROC
package.


One more thing, in calculating the AUC I see the splines function is
recomended over the approx function. Here...

http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html

How would I rewrite the following AUC functions (adapted from bioconductor
source) to use splines (or approxfun or splinefun) ...

spe # Specificity


[1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826
[7] 0.65217391 0.76086957 0.89130435 1.00000000 1.00000000 1.00000000
[13] 1.00000000

sen # Sensitivity


[1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9302326 0.8139535
[8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791

trapezint(1-spe,sen)
my.integrate(1-spe,sen)

## Functions
## Nicked (and modified) from the ROC function in bioconductor.
"trapezint" <-
function (x, y, a = 0, b = 1)
{
   if (x[1] > x[length(x)]) {
     x <- rev(x)
     y <- rev(y)
   }
   y <- y[x >= a & x <= b]
   x <- x[x >= a & x <= b]
   if (length(unique(x)) < 2)
       return(NA)
   ya <- approx(x, y, a, ties = max, rule = 2)$y
   yb <- approx(x, y, b, ties = max, rule = 2)$y
   x <- c(a, x, b)
   y <- c(ya, y, yb)
   h <- diff(x)
   lx <- length(x)
   0.5 * sum(h * (y[-1] + y[-lx]))
}

"my.integrate" <-
function (x, y, t0 = 1)
{
   f <- function(j) approx(x,y,j,rule=2,ties=max)$y
   integrate(f, 0, t0)$value
}

Thanks for any pointers,
Dan.

I don't see why the above formulas are being used. The Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works great. Just get the U statistic (concordance probability) used in Wilcoxon. As Somers' Dxy rank correlation coefficient is 2*(1-C) where C is the concordance or ROC area, the Hmisc package function rcorr.cens uses U statistic methods to get the standard error of Dxy. You can easily translate this to a standard error of C.

I am sure I could do this easily, except I can't.

The good thing about ROC is that I understand it (I can see it). I know why the area means what it means, and I could even imagine how sampling the data could give a CI on the area.

However, I don't know why "the area under the ROC curve is well known to
be equivalent to the numerator of the Mann-Whitney U statistic" - from

http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf


Nor do I know how to calculate "the numerator of the Mann-Whitney U
statistic".

This is clear in the original Bamber or Hanley-McNeil articles. The ROC area is a linear translation of the mean rank of predicted values in one of the two outcome groups. The little somers2 function in Hmisc shows this:

##S function somers2
##
##    Calculates concordance probability and Somers'  Dxy  rank  correlation
##    between  a  variable  X  (for  which  ties are counted) and a binary
##    variable Y (having values 0 and 1, for which ties are not  counted).
##    Uses short cut method based on average ranks in two groups.
##
##    Usage:
##
##         somers2(X,Y)
##
##    Returns vector whose elements are C Index, Dxy, n and missing, where
##    C Index is the concordance probability and Dxy=2(C Index-.5).
##
##    F. Harrell 28 Nov 90     6 Apr 98: added weights

somers2 <- function(x, y, weights=NULL, normwt=FALSE, na.rm=TRUE) {
  if(length(y)!=length(x))stop("y must have same length as x")
  y <- as.integer(y)
  wtpres <- length(weights)
  if(wtpres && (wtpres != length(x)))
        stop('weights must have same length as x')
  if(na.rm) {
        miss <- if(wtpres) is.na(x + y + weights) else is.na(x + y)
        nmiss <- sum(miss)
        if(nmiss>0)     {
          miss <- !miss
          x <- x[miss]
          y <- y[miss]
          if(wtpres) weights <- weights[miss]
        }
  } else nmiss <- 0
   u <- sort(unique(y))
  if(any(y %nin% 0:1)) stop('y must be binary')  ## 7dec02
  if(wtpres) {
        if(normwt) weights <- length(x)*weights/sum(weights)
        n <- sum(weights)
  } else n <- length(x)

  if(n<2)stop("must have >=2 non-missing observations")

  n1 <- if(wtpres)sum(weights[y==1]) else sum(y==1)
  if(n1==0 || n1==n) return(c(C=NA,Dxy=NA,n=n,Missing=nmiss))  ## 7dec02
  ## added weights > 0 30Mar00
  mean.rank <- if(wtpres) mean(wtd.rank(x, weights, na.rm=FALSE)[weights >
                                                      0 & y==1]) else
                 mean(rank(x)[y==1])
  c.index <- (mean.rank - (n1+1)/2)/(n-n1)
  dxy <- 2*(c.index-.5)
  r <- c(c.index, dxy, n, nmiss)
  names(r) <- c("C","Dxy","n","Missing")
  r
}

Just ignore all the stuff with weights. The ROC area is given by c.index in the 5th line from the end. But somers2 does not compute its standard error.

A standard text on nonparametric tests shows how to compute the Wilcoxon-M-W test stat based on mean(rank(x in y==1 group)).

Frank


Can you point me at some ? pages or tutorials or even give an example of
what you suggested so I can try to follow it through?

I tried the following...

x <- rnorm(100,5,1)    # REAL NEGATIVE
#
y <- rnorm(100,8,1)    # REAL POSITIVE

t <- wilcox.test(x,y,paired=FALSE,conf.int=0.95)

t

        Wilcoxon rank sum test with continuity correction

data: x and y W = 132, p-value < 2.2e-16 alternative hypothesis: true mu is not equal to 0 95 percent confidence interval: -3.232207 -2.664620 sample estimates: difference in location -2.957496


And from ?wilcox.test ...

"if both x and y are given and paired is FALSE, a Wilcoxon rank sum test
(equivalent to the Mann-Whitney test) is carried out."


But I don't know what to do next. Sorry for all the questions, but I am a
dumb biologist.

Thanks for the help, Dan.

Frank

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Standard error for the area under a smoothed ROC curve?

Reply via email to