On 15 Nov 2011, at 23:59, Derek Lamb wrote:

> I would like to change some of the definitions of the quantities returned by 
> statsover.  I find that either their names or their calculations are not 
> consistent with normal statistical practices.  However I also know that the 
> statistical terminology used by different communities can be different, so I 
> wanted to make sure I wasn't stepping on too many toes first.  In particular:
> 
> 1) the absolute deviation is given in the docs as:
>       ADEV = sqrt(sum( abs(x-mean(x)) )/N)
> with a note that "This is also called the standard deviation"

You are totally right about this one. This has a) never been called the 
standard deviation nor b) has the absolute deviation every been defined in this 
way. Even the units would be wrong with this usage. There is some variation in 
the definition of the absolute deviation and about language, although it is 
never what you show there. The most common in my experience is:

   ADEV = Sum( |x-<x>|)/N,

which is what you are suggesting, where <x> is the mean. Sometimes it is the 
median instead (my personal preference). In this case it is known as the 
average absolute deviation or the mean absolute deviation - in the latter case 
you often find it with the acronym MAD.  There is also an even more robust 
estimator called the median absolute deviation which is:

   MedAD = median ( |x-<x>|)

but I see this much less often. It could be good to have in PDL perhaps, but as 
the name normally would be MAD it could be confusing.

I'd suggest leaving ADEV to be the average absolute deviation above with <x> to 
be the mean(x) which i think is exactly what you suggest. I do think this has 
to be changed as the current implementation is plain wrong.

> 3) We have two root-mean-square calculations, a regular parent distribution 
> divide-by-N, and a sample population divide-by-(N-1).  I'm not sure why we 
> have both of these--will a piddle ever be able to contain a parent 
> distribution?  Probably not--my definition has it taking the average as the 
> number of points goes to infinity.  If it were up to me I would remove the 
> RMS calculation so that statsover would only return 6 quantities (including 
> the PRMS) instead of 7--the difference in the two calculations is negligible 
> for large datasets, and for small datasets one should not be using the RMS 
> calculation anyway, correct?  But I worry about backwards compatibility, 
> particularly with these sorts of constructs:
> 
> $rms = @{statsover($pdl)}[-1]  (that doesn't work, I can never remember that 
> syntax, but you probably get the point--the poor user is going to get the 
> ADEV instead)

Bah, I didn't realise we had two. The sample variance is probably the most 
sensible to keep - but note that if you know (somehow) the mean, then even the 
sample variance is divided by N. Anyway, I think it is dodgy to make 
significant changes here in stats - changing the docs would be my preferred 
solution here.

        Cheers,
                Jarle.


_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to