Robert J. MacG. Dawson wrote: > Mike wrote: >> >> Greetings all, >> >> I'd like to estimate the 95th percentile of a distribution p(X) by >> making N independent measurements of X. I'm assuming that the 95th >> percentile of measurements is the best estimate of the 95th >> percentile for the distribution. >> >> Is this halfway reasonable? >> Do I need to make strong assummptions about the shape of p(X)? >> How would I arrive at the s.d. of the estimate, or some other >> indicator of quality? > > > (1) In the absence of distributional assumptions, I don't think you > have any alternative but to use the 95th percentile of the data. You > will have no way to decide if the resulting estimator is unbiased or > to determine its standard deviation. Consider, for instance, the > prizes in a fair lottery with 1000 one-dollar tickets and one $1000 > prize; and suppose your sample consists of 10 tickets. > > The true 95th percentile prize is 0. However, one time in 100 your > sample will contain the winning ticket and you will estimate the 95th > percentile at $1000; thus your mean estimate will be $10. > > Consideration of such extreme distributions also shows that you have > no useful way of estimating the SD from your sample. > > (2) If you have a single-parametric model, estimating the 95th > percentile is normally equivalent to estimating the parameter. > (Pathological counterexamples, such as ones in which the 95th > percentile is the same for every distribution in the family, exist. > [EG: Unif[-19A,A] if you want a simple one!]) That is: estimating the > 95th percentile is essentially the same task as estimating the mean, > or the median, or... > > In this case it is highly unlikely that the 95th percentile of the > sample will be an optimal estimator for the 95th percentile of the > distribution. > > (3) Somewhere between these extremes there are presumably > semiparametric families of distributions (perhaps symmetric > distributions, or distributions obeying entropy constraints, or > distributions within a certain distance in probability of normal > distributions, or...) for which other answers to your question are > appropriate. Just as a guess, I'd say that this looks like a serious > research problem (or cottage industry) if it hasn't already been done. > > -Robert Dawson
The above has overlooked a number of possibilities: (i) use of a Kernel estimate for the distribution function (as a non-parametric estimate alternative to just using the sample quantile). (ii) An alternative indication of "quality" of the procedure is to replace the question of wanting the s.d of the quantile, by the question of how well the probability (that a new random value will be above a given fixed threshold) is estimated by the sample probability. Thus you can find a confidence interval for the true percentage point of a value selected to be approximately at the 95% point. This is readily worked out using the Binomial distribution. (iii) Fit a distribution to the data and get an idea of the sampling variation of the estimated quantile by simulating data sets of the same sample size from this fitted distribution. Possibly over-fit the distribution but pay attention to getting the shape of the density "right" in the region near the 95'th percentile. David Jones . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
