Raul Miller wrote: > That said... I want to be able to talk about populations, distributions, > samples, etc. using a consistent set of terminology regardless of > whether or not the distribution of the population is known or unknown, > or partially understood.
The standard terminology covers all these cases, although there are some ambiguities. First note that a sample space is just a set of outcomes: is is different from a sample, which is a list of random variables. A population is just something with a distribution, a probability function f, which simply means f(x)>:0 for all x and \sum f(x)=1. The population variance exists whether we know the population distribution or not, whether we have an estimator or not, and is a number. If the distribution has probability function f, and X is a random variable having distribution f, then if E(X)=mu, the population variance is defined as \sigma^2=E((X-mu)^2)=E(X^2)-E(X)^2, where mu=E(X)=\sum x f(x), and E(X^2)=\sum x^2 f(x). Each of these is a number. To calculate \sigma^2, you have to know f, but it is defined in any case. The purpose of estimators is to get estimates for population parameters when you do not know f. A sample of size n is (as I have previously described) a list of n independent variables, each of whose distribution is f, and a statistic is a function of these variables. To get around the terminology difficulties, suppose we have a statistic A(X1,...,An). This is a random variable, and so has an expected value. We say A is an unbiased estimator of \sigma^2 if E(A)=\sigma^2. We then conduct a statistical experiment and evaluate X1,...,Xn on the outcome to give numbers x1,...xn, the value of the random sample. Then A(x1,...,xn) gives an a number which is an estimate for \sigma^2. If we know the distribution of A, we can also get a confidence interval. All that I am asserting is that if A(X1,...,Xn)=(1/n-1) \sum (Xi-\bar X)^2 Then E(A)=\sigma^2, and the right hand side of the expression for A is called the sample variance. There is a some ambiguity here: the sample variance is used to refer to either S^2=(1/n-1) \sum (Xi-\bar X)^2 (a random variable) or s^2=(1/n-1) \sum (xi-\bar x)^2 (the value of this on a particular sample). There is a typographical convention that distinguishes these cases: population parameters are lower-case greek letters, random variables are upper-case Roman letters, and the values of random variables are lower-case Roman letters. However, I believe you are using the words sample and sampling in a nonstandard sense. Here's what I think you are doing. You have a known population distribution f. You now take a vector v of length n in which x appears c(x) times, with the property that c(x)/n is approximately f(x). You regard this as a set of equiprobable outcomes, and so determines a population with distribution g satisfying g(x)=c(x)/n. This population has \mu=\sum x g(x)=(1/n)\sum x c(x) \sigma^2=\sum (x-\mu)^2 g(x)=(1/n)\sum (x-\mu)^2 c(x). So in this case using a denominator of 1/n makes sense. However, this is not the sample variance in any normally accepted sense: there is no sample in sight. There are n equiprobable outcomes defining a random variable X with P(X=x)=c(x)/n. Let me know if this is what you are getting it. Best wishes, John ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
