On 6/29/07, John Randall <[EMAIL PROTECTED]> wrote:
Raul Miller wrote:
> Sure -- that's why I called that numerical model a way of "checking
> my work" instead of a "proof". But, if the math is valid, then the math
> should remain valid when I plug in the numbers.
This is where the problem lies.
Let S^2=(1/n)\sum (X_i-\mu)^2. Then E(S^2)=\sigma^2, the population
variance.
However, if you replace \mu by \bar X, it is not true that
E((1/n)\sum (X_i-\bar X)^2=\sigma^2.
You are assuming that E((X-Y)^2)=E(X-E(Y))^2, which is false. Assuming
this is equivalent to saying E(Y^2)=E(Y)^2.
Ok, it looks like my implementation was incorrect.
I had:
NB. $=(1/n-1)( \sum E((X_i-\mu)^2)-nE((\bar X-\mu)^2 ))$
t5=:(%n1)*+/"1 mean ([:*:(-mean))"1 samples
But that code clearly does not match the underlying concept. As you
point out, I do not use mu at all here.
However, if I replace that t5=: line with
t5=:(%n1)* ( (+/n#mean *:mu-~y)) - n*([:mean ([:*:mu-~mean)"1) samples
the assertions are still valid.
Moreover, they are valid both in the version for your proof, and in my
variant where n1=:n-1 and s0=:s1
At least, this is the case for every example I have tried.
> In this case, my "samples" precisely represent the entire population.
>
> For example, let's consider your hypothetical case of number of
> heads from a coin toss.
>
> With a fair coin, the population, with distribution is:
> 0: 50%
> 1: 50%
>
> The possible samples for a sample size of 2 are then
> 0 0: 25%
> 0 1: 25%
> 1 0: 25%
> 1 1: 25%
...
> Thus, for E(\sum (X_i-\bar X)^2)=\sum (x_i-\bar x)^2
I still don't get this. I am using the notation x_i for an actual sample
value. The left hand side is a number. The right hand side will vary
based on the actual sample chosen.
Ok... I don't remember why I wrote that final quoted line the
way I did. I thought I had copied and pasted in a line from
your proof that had the term E(\sum (X_i-\bar X)^2), but it's
clear that I did not.
Anyways, I should not have included =\sum (x_i-\bar x)^2
as that is irrelevant to my point, and does not accurately
reflect the concepts I am working with.
> I can determine \sum(X_i-\bar X)^2 for each of those
> potential sample cases (0, 0.5, 0.5, 0) and then
> average them. For the fair coin, this average is
> 0.25. For that unfair coin, I get 0.1875 for this
> average.
This precisely illustrates the point. The population variance is 3%8,
not 3%16. You can see in this case that using denominator n=2
gives the wrong answer, using denominator n-1 gives the correct
answer.
I am not sure this is a valid point.
See
http://en.wikipedia.org/wiki/Standard_deviation#Estimating_population_standard_deviation_from_sample_standard_deviation
as an example treatment which seems to conflict with the assertions
you have advanced in this last paragraph.
As I understand the wiki write up, even if you consider denominator
n=1 as valid for samples which do not represent the population,
you must still use denominator n=2 when you are dealing with
the population distribution.
--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm