On 30-Aug-09 16:24:08, Emmanuel Charpentier wrote: > Le dimanche 30 août 2009 Ã_ 18:43 +0530, Ajay Shah a écrit : >> Folks, >> I have this code fragment: >> set.seed(1001) >> x <- c(0.79211363702017, 0.940536712079832, 0.859757602692931, >> 0.82529998629531, 0.973451006822, 0.92378802164835, >> 0.996679563355802,0.943347739494445, 0.992873542980045, >> 0.870624707845108,0.935917364493788) >> range(x) >> # from 0.79 to 0.996 >> >> e <- function(x,d) { >> median(x[d]) >> } >> >> b <- boot(x, e, R=1000) >> boot.ci(b) >> >> The 95% confidence interval, as seen with `Normal' and `Basic' >> calculations, has an upper bound of 1.0028 and 1.0121. >> >> How is this possible? If I sample with replacement from values which >> are all lower than 1, then any sample median of these bootstrap >> samples should be lower than 1. The upper cutoff of the 95% confidence >> interval should also be below 1. > > Nope. "Normal" and "Basic" try to adjust (some form of) normal > distribution to the sample of your statistic. But the median of such a > small sample is quite far from a normal (hint : it is a discrete > distribution with only very few possible values, at most as many value > as the sample. Exercise : derive the distribution of median(x)...). > > To convince yourself, look at the histogram of the bootstrap > distribution of median(x). Contrast with the bootstrap distribution of > mean(x). Meditate. Conclude... > HTH, > Emmanuel Charpentier
Ajay, You have not said why you are adopting the bootstrap approach. Maybe you specifically want to study the behaviour of bootstrap methods, in which case my response below is irrelevant. But if, on the other hand, you are simply looking for a confidence interval for the median, you might consider a non-paramatric approach. This -- which does not depend on distributional assumptions about the data -- is based on the fact that if "med" denotes the median of the distribution of a continuous variables X, Prob(X < med) = P(X > med) = 1/2 Hence, if X[1], X[2], ... , X[n] denote the values of a random sample of X-values in increasing order, then Prob(X[1] > med) = 1/2^n, and for r > 1: Prob( (X[r]>med) ) = Prob( (X[1]>med) ) + Prob( (X[1]<med)&(X[2]>med) ) + ... + Prob( (X[1]<med)&(X[2]<med)&...&(X[r-1]<med)&(X[r]>med) ) i.e. terms summed over all the r disjoint ways in which "X[r]>med" can occur. These terms are all straightforward binomial probabilities, with p = 1/2, so Prob( (X[r]>med) ) = (1/2^n) + n*(1/2^n) + ... + choose(n,(r-1))*(1/2^n) = pbinom((r-1),n,1/2) Hence if, for instance, you find the maximum value of r such that for given n: Prob( (X[r]>med) < = 0.025 then the probability is at least 0.975 that "X[r] < med" whatever the value of "med" may be. Hence the element of the sorted sample in the r-th position is a lower 97.5% one-sided confidence limit for "med". Because you are looking at the median, the situation is symmetrical, i.e. Prob(X > med) = Prob(X < med) = 1/2, so a corresponding 97.5% upper one-sided confidence limit for "med" is the element of the sorted sample in the (n+1-r)-th position. Hence the pair of these two limits is a 95% confidence interval for the median. Now, since your (admittedly small) sample size is n=11, you can get at it as follows: Ps <- pbinom(q=(0:11), n=11, p=1/2) round(rbind((0:11),Ps,rev(Ps)), 3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] 0 0.006 0.033 0.113 0.274 0.500 0.726 0.887 0.967 0.994 1.000 1 1 1.000 0.994 0.967 0.887 0.726 0.500 0.274 0.113 0.033 0.006 0 Hence the maximum value of r from the first line is r=2, so the second of 11 sorted observations is the lower limit, and the 10th is the upper limit, for a 95% confidence interval. In the case of your data, x <- c(0.79211363702017, 0.940536712079832, 0.859757602692931, 0.82529998629531, 0.973451006822, 0.92378802164835, 0.996679563355802,0.943347739494445, 0.992873542980045, 0.870624707845108,0.935917364493788) print(sort(x)[c(2,10)],15) # [1] 0.825299986295310 0.992873542980045 Note that this is a conservative confidence interval, in that it is guaranteed that the confidence level Prob(LowerLimit < median < UpperLimit ) > 0.95 and it is the shortest such interval with symmetry. As it happens, with n=11, this probability is 1 - 2*pbinom(1,11,1/2) = 0.9882812 so it is very conservative (a consequence of small n). You could get a less conservative interval by using an asymmetrical interval, e.g. the 2nd and 9th, or the 3rd and 10th, when the probability would be 1 - pbinom(1,11,1/2) - pbinom(2,11,1/2) = 0.9614258 which is pretty close to the "target confidence level" while still not being less than the target, but then you have to have a reason for preferring one ([2,9]) to the other [(3,10)]! Or you could choose one of them at random ... Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 31-Aug-09 Time: 12:05:40 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.