Re: [Numpy-discussion] What should be the result in some statistics corner cases?

Benjamin Root Mon, 15 Jul 2013 08:56:00 -0700

On Jul 15, 2013 11:47 AM, "Charles R Harris" <charlesr.har...@gmail.com>
wrote:


>
>
> On Mon, Jul 15, 2013 at 8:58 AM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>> On Mon, Jul 15, 2013 at 8:34 AM, Sebastian Berg <
>> sebast...@sipsolutions.net> wrote:
>>
>>> On Mon, 2013-07-15 at 07:52 -0600, Charles R Harris wrote:
>>> >
>>> >
>>> > On Sun, Jul 14, 2013 at 3:35 PM, Charles R Harris
>>> > <charlesr.har...@gmail.com> wrote:
>>> >
>>>
>>> <snip>
>>>
>>> >
>>> >                 For nansum, I would expect 0 even in the case of all
>>> >                 nans.  The point
>>> >                 of these functions is to simply ignore nans, correct?
>>> >                  So I would aim
>>> >                 for this behaviour:  nanfunc(x) behaves the same as
>>> >                 func(x[~isnan(x)])
>>> >
>>> >
>>> >         Agreed, although that changes current behavior. What about the
>>> >         other cases?
>>> >
>>> >
>>> >
>>> > Looks like there isn't much interest in the topic, so I'll just go
>>> > ahead with the following choices:
>>> >
>>> > Non-NaN case
>>> >
>>> > 1) Empty array -> ValueError
>>> >
>>> > The current behavior with stats is an accident, i.e., the nan arises
>>> > from 0/0. I like to think that in this case the result is any number,
>>> > rather than not a number, so *the* value is simply not defined. So in
>>> > this case raise a ValueError for empty array.
>>> >
>>> To be honest, I don't mind the current behaviour much sum([]) = 0,
>>> len([]) = 0, so it is in a way well defined. At least I am not sure if I
>>> would prefer always an error. I am a bit worried that just changing it
>>> might break code out there, such as plotting code where it makes
>>> perfectly sense to plot a NaN (i.e. nothing), but if that is the case it
>>> would probably be visible fast.
>>>
>>> > 2) ddof >= n -> ValueError
>>> >
>>> > If the number of elements, n, is not zero and ddof >= n, raise a
>>> > ValueError for the ddof value.
>>> >
>>> Makes sense to me, especially for ddof > n. Just returning nan in all
>>> cases for backward compatibility would be fine with me too.
>>>
>>
>> Currently if ddof > n it returns a negative number for variance, the NaN
>> only comes when ddof == 0 and n == 0, leading to 0/0 (float is NaN, integer
>> is zero division).
>>
>>
>>>
>>> > Nan case
>>> >
>>> > 1) Empty array -> Value Error
>>> > 2) Empty slice -> NaN
>>> > 3) For slice ddof >= n -> Nan
>>> >
>>> Personally I would somewhat prefer if 1) and 2) would at least default
>>> to the same thing. But I don't use the nanfuncs anyway. I was wondering
>>> about adding the option for the user to pick what the fill is (and i.e.
>>> if it is None (maybe default) -> ValueError). We could also allow this
>>> for normal reductions without an identity, but I am not sure if it is
>>> useful there.
>>>
>>
>> In the NaN case some slices may be empty, others not. My reasoning is
>> that that is going to be data dependent, not operator error, but if the
>> array is empty the writer of the code should deal with that.
>>
>>
> In the case of the nanvar, nanstd, it might make more sense to handle ddof
> as
>
> 1) if ddof is >= axis size, raise ValueError
> 2) if ddof is >= number of values after removing NaNs, return NaN
>
> The first would be consistent with the non-nan case, the second accounts
> for the variable nature of data containing NaNs.
>
> Chuck
>
>
>
I think this is a good idea in that it naturally follows well with the
conventions of what to do with empty arrays / empty slices with nanmean,
etc. Note, however, I am not a very big fan of the idea of having two
different behaviors for what I see as semantically the same thing.

But, my objections are not strong enough to veto it, and I do think this
proposal is well thought-out.

Ben Root

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

Reply via email to