On 6/29/07, John Randall <[EMAIL PROTECTED]> wrote:
I understand what you are trying to do, but I believe it has the same
problem as we have had before:
- A statistic (e.g. the sample mean) is a random variable: it is
different from its value on an actual sample.
I believe I've treated them that way, but perhaps I should have
used a different naming convention ('possibilities' instead of
'samples' when enumerating possible cases).
- The expectation of a statistic is different from its value on an
actual sample.
Absolutely.
More specifically, I use "mean outcome when considering
all possibilities" for the expectation. And this has very
different characteristics from straightforward arithmetic
on samples.
The distribution of a statistic depends on the population
distribution. Its expected value depends on the population
distribution. Since we are trying to use a statistic to tell us
something about the population, we seem to be in a chicken and egg
situation: the population has an unknown distribution, and the
statistic depends on this.
Well... my approach requires I work with a specific population and a
specific distribution. Also, my implementation only supports
certain kinds of populations (finite and small populations) and
distributions (with low order rational numbers for probabilities for
members of the populations), but I have tested with different
distributions for the same population.
I should probably also note that I get an assertion failure for
0 1 2 assertions 1 0 1 0 0 0
But this assertion failure is on one of your "givens", and the
underlying problem can be avoided using
0 1 2 assertions&x: 1 0 1 0 0 0
I suppose I could have also accepted an argument where each
member of the population had an explicit probability assigned
to it. But as far as I can see, this would just have made my
implementation more complicated (when determining possibilities
I would need to use the product of the population member probabilities
to determine the probability of that possibility) without changing
the fundamental nature of the problem.
The idea of the case we are discussing is to estimate the population
variance without knowing anything about the population distribution.
This is point estimation, and the proof I gave shows that this can be
done, namely that E(S^2)=\sigma^2.
Sure -- that's why I called that numerical model a way of "checking
my work" instead of a "proof". But, if the math is valid, then the math
should remain valid when I plug in the numbers.
To see how accurate our estimate is, we would need to know something
about the population distribution and how this affects the
distribution of the statistic, and we cannot do it in general. If the
population is normal, then some multiple of our estimate has the
chi-squared distribution, so we can do things like construct
confidence intervals for the estimate.
Yes. And I do not claim that my model is general. I've only
tested a few cases.
That said, my experience, building the model, was that errors in
my understanding cause problems with just about any example
population and distribution. For example, if I had not been
trying to build this model, I would not have known to ask about
the lines where it turns out X_i had population distribution rather
than sample distribution.
However, it is not the case that
E(\sum (X_i-\bar X)^2)=\sum (x_i-\bar x)^2
any more than it is the case that the expected value of the sample
mean is the same as its value on an actual sample.
I don't think I said that it did. I would say that
E(\sum (X_i-\bar X)^2)
is the mean of
\sum(X_i-\bar X)^2)
when performed against the population of all possible
values for X_i and \bar X. And, of course, this is doable
when you are working with a specific population and
sample size.
If you are trying to calculate an expected value by averaging it over
samples taken from the population, you will get an estimate, but what
does it mean? This is precisely what estimation is about.
In this case, my "samples" precisely represent the entire population.
For example, let's consider your hypothetical case of number of
heads from a coin toss.
With a fair coin, the population, with distribution is:
0: 50%
1: 50%
The possible samples for a sample size of 2 are then
0 0: 25%
0 1: 25%
1 0: 25%
1 1: 25%
I don't actually need to enumerate probabilities for this
case, since it's evenly distributed. However, if I had an
unfair coin I could deal with that as well, using basically
the same approach:
0: 25%
1: 75%
with possibilities:
0 0: 0.0625%
0 1: 0.1875%
1 0: 0.1875%
1 1: 0.5625%
Thus, for E(\sum (X_i-\bar X)^2)=\sum (x_i-\bar x)^2
I can determine \sum(X_i-\bar X)^2 for each of those
potential sample cases (0, 0.5, 0.5, 0) and then
average them. For the fair coin, this average is
0.25. For that unfair coin, I get 0.1875 for this
average.
Anyways, my numerical techniques do not show that your original
proof was a valid proof. Nor do they determine that my modified
version of your proof is valid. All these techniques do is show
that for example cases some assertions outlined in your proof
(and in mine) are valid.
That said, I think this is a useful approach, since if the proof
is valid those assertions must be valid. If nothing else, this
lets me rather directly zero in on issues I do not understand
in the notation.
Besides, Oleg asked to see what I was doing...
--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm