On Mon, Mar 9, 2009 at 8:29 AM, Alois Schlögl <alois.schlo...@tugraz.at> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Søren Hauberg wrote:
>> lør, 07 03 2009 kl. 09:42 -0500, skrev James K. Lowden:
>>> Alois Schlögl wrote:
>>>> Skipping NA/NaN is valid for the mean as well as for any other
>>>> statistical estimate.
>>> That is not always so.  Suppose you intend to compute the mean of N values
>>> but due to an error in your database query, 90% of those values are
>>> missing.  Are you prepared to say that the mean of the other 10% is
>>> representative?
>>
>> I would say that was the best estimate you could possibly get.
>
>
> Exactely for this reason, the nan-skipping is the right thing to do.
> Actually, NaN/sem.m gives the confidence interval on the mean, if you
> really need it.
>
> I do not understand what advantage it has to distinguish between NaN and
> NA. In the database, there might be not-a-number due to missing data and
> some due to a division 0/0. In order to get the best estimate (or an
> estimate at all), you need to ignore both NA's and NaN's.
>

If I have both NAs (missing values) and NANs (invaliud values) in my data,
then the best estimate ignores the NAs but not the NaNs.


> Moreover, the distinction between NaN and NA's complicates thinks again
> Should one set a sample to NA or to NaN, there is an overflow in my data
> acquisition? Justing the need to think about it pointless.
>

Obviously, it depends on how you want to treat that overflow. If you
want it to be ignored, set it to NA. If you want it to yield an
invalid result, use NaN.

>
>>
>>> NaNs convey meaning, as Søren said.
>>
>> Actually, what I said was that there was a difference between something
>> being not-a-number, and something being missing. It makes perfect sense
>> to skip missing values when computing the mean value (in the statistical
>> sense). However, it does not make sense to ignore NaN's when they convey
>> the meaning that something went wrong somewhere else in your program.
>> Jaroslav explained this well.
>
>
> In such a case, I strongly recommend an explicit handling of NaN's. The
> code would emphasize that the NaN's "convey meaning".
>

It's not really relevant what anyone recommends. A user of the
statistical functions may or may not choose to follow your
recommendation for "explicit handling", but you just can't tell that
from inside the function. By distinguishing NaNs and NAs you will
support both invalid and missing values. Recommendations may do some
good for novice users, but advanced users prefer functions giving them
choices instead.
Unlike Matlab, Octave has NA, so it is capable of dealing with both
missing and invalid values at the same time, which is clearly superior
to missing-only or invalid-only.
Also, this is analogous to R's capabilities, and R is used *very*
widely amongst statisticians.

regards

-- 
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Reply via email to