On 6/29/07, John Randall <[EMAIL PROTECTED]> wrote:
Raul Miller wrote:
> On 6/29/07, John Randall <[EMAIL PROTECTED]> wrote:
> But I believe n(\sigma^2/n) is \sigma^2 and I cannot find any
> interpretation of $E(S^2)=(1/n-1)(n\sigma^2 -(\sigma^2)) that
> seems numerically valid.
> ...
This is a statement about the expected value of a random variable. You
cannot numerically validate it except in cases where the sample space is
finite and you know the distribution. However, the above statement is
asserting that S^2 is an unbiased estimator of \sigma^2 for any
distribution.
I am only trying to validate it for specific cases where the finite space
is small and finite, and where I know the distribution.
> must be false.
Your calculations seem entirely consistent. What is the problem.
Never mind, for some reason I was confused about the distinction
between \sigma(X_i) and \sigma(\bar X).
I'll paste an updated copy of my numerical model (with my earlier
errors fixed) at the bottom of this message.
> The distribution of the population is:
> 0: 50%
> 1: 50%
>
> However, any single random sample containing three values
> cannot represent this distribution. The potential samples are:
> 0 0 0 (0: 100%, 1: 0%)
> 0 0 1 (0: 66.67% 1: 33.33%)
> etc.
A single random variable containing one value can "represent" this
population:
counting the number of heads when we flip a coin is precisely this. A
particular random sample just has some distribution: in this way it is
representative of the population.
Maybe I'm stumbling over a technical meaning for the term "sample".
I'm intending to use the word "sample" to represent the result of some
sequences of trials. Thus, a sample of three values has the results
from three trials. Although the underlying random variables completely
represent the underlying population, this single sample (the result of these
three trials), in the general case, would not represent complete information
about the underlying population.
That said, there's a problem with the english word "representative",
in that something can represent something else inaccurately and
still be said to represent it. In other words, not all representations
are equally valid.
P.S. here's my current numerical model:
assertions=:1 :0
:
NB. u adjusts deviation population size when computing variance
NB. y is population
NB. x is the sample size
mean=: +/ % #
s0=: %:@((%u)[EMAIL PROTECTED] * mean)@:*:@:- mean NB. "sample standard
deviation"
s1=: %:@mean@:*:@:- mean NB. "population standard deviation"
n=:x
n1=:u n
mu=: mean y =.x:y
NB. all possible samples (each has equal probability)
possibilities=:y {~"1 (n##y)#:i.n^~#y
NB. $\sum (X_i-\bar X)^2$
t0=: ([:+/2^~]-mean)"1 possibilities
NB. $=\sum ((X_i-\mu)-(\bar X -\mu))^2$
t1=: ([:+/2^~(mu-~])-mu-~mean)"1 possibilities
assert.t0-:t1
NB. $=\sum ((X_i-\mu)^2) -n(\bar X-\mu)^2$,
t2=: (([:+/2^~mu-~])-n*2^~mu-~mean)"1 possibilities
assert.t0-:t2
NB. since $\sum (X_i-\mu)=(\sum X_i) - n\mu = n(\bar X-\mu)$.
assert. 1=#~.(([:+/mu-~])"1;((n*mu)-~+/)"1;(n*mu-~mean)"1) possibilities
NB. Then $E(S^2)=E((1/n-1)\sum (X_i-\bar X)^2$
t3=: mean *:@s0"1 possibilities
t4=: mean ((%n1)*[:+/2^~]-mean)"1 possibilities
assert. t3-:t4
NB. $=(1/n-1)( \sum E((X_i-\mu)^2)-nE((\bar X-\mu)^2 ))$
t5=:(%n1)* ( (+/n#mean *:mu-~y)) - n*([:mean ([:*:mu-~mean)"1) possibilities
assert. t3-:t5
NB. $=(1/n-1)( \sum \sigma^2(X_i)- n\sigma^2(\bar X) )$
t6=:(%n1)* (+/*:n#s1 y) - n* *:@s1@:(mean"1) possibilities
assert. t3-:t6
NB. But $\sigma^2(X_i)=\sigma^2$, $\sigma^2(\bar X)=\sigma^2/n$, so
assert. (*:s1 mean"1 possibilities) -: (%n) * *:s1 y
NB. $E(S^2)=(1/n-1)(n\sigma^2 -n(\sigma^2/n))=\sigma^2$.
t7=: (%n1)* (n**:s1 y) - *:s1 y
assert. t3-:t7
)
This will compute standard deviation for a given small finite
sample size (which must be the size of the left argument) and
a given evenly distributed population (which must be the right
argument), if the verb argument is <:
2 <: assertions 0 1
0.25
2 <: assertions 1 2 3 4 5 6
2.91667
That said, the interesting thing about <: is that changing the
sample size does not appear to change the computed standard
deviation (when performed against the full population with its
distribution). At least superficially speaking, it seems that if
the verb argument is some other linear operation the limit
of the computed standard deviation (as sample size increases to
a maximum) would be the same as standard deviation computed
with <:
[Unless, of course, the sample size is 1 -- then this mechanism
breaks down as there can be no deviation for this case.]
In other words
test=:1 :0
t=. 2 u assertions y
assert. t -: 3 u assertions y
)
seems to be valid only when the verb is <:
Put differently, the thing I wasn't getting about your proof was that
the assertions must hold for arbitrary n where n (and its dependents)
must be able to vary from one side of an equation to the next.
(As this seems an incredibly important issue, it's probably worth
stating explicitly.)
(And, with a little work, I could probably incorporate a mechanism that
arbitrarily changes n (or uses several different values for n) at
each stage of the calculations.)
Thanks,
--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm