On Mon, Jul 15, 2013 at 2:44 PM, <[email protected]> wrote: > On Mon, Jul 15, 2013 at 4:24 PM, <[email protected]> wrote: > > On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <[email protected]> wrote: > >> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris > >> <[email protected]> wrote: > >>> Let me try to summarize. To begin with, the environment of the nan > functions > >>> is rather special. > >>> > >>> 1) if the array is of not of inexact type, they punt to the non-nan > >>> versions. > >>> 2) if the array is of inexact type, then out and dtype must be inexact > if > >>> specified > >>> > >>> The second assumption guarantees that NaN can be used in the return > values. > >> > >> The requirement on the 'out' dtype only exists because currently the > >> nan function like to return nan for things like empty arrays, right? > >> If not for that, it could be relaxed? (it's a rather weird > >> requirement, since the whole point of these functions is that they > >> ignore nans, yet they don't always...) > >> > >>> sum and nansum > >>> > >>> These should be consistent so that empty sums are 0. This should cover > the > >>> empty array case, but will change the behaviour of nansum which > currently > >>> returns NaN if the array isn't empty but the slice is after NaN > removal. > >> > >> I agree that returning 0 is the right behaviour, but we might need a > >> FutureWarning period. > >> > >>> mean and nanmean > >>> > >>> In the case of empty arrays, an empty slice, this leads to 0/0. For > Python > >>> this is always a zero division error, for Numpy this raises a warning > and > >>> and returns NaN for floats, 0 for integers. > >>> > >>> Currently mean returns NaN and raises a RuntimeWarning when 0/0 > occurs. In > >>> the special case where dtype=int, the NaN is cast to integer. > >>> > >>> Option1 > >>> 1) mean raise error on 0/0 > >>> 2) nanmean no warning, return NaN > >>> > >>> Option2 > >>> 1) mean raise warning, return NaN (current behavior) > >>> 2) nanmean no warning, return NaN > >>> > >>> Option3 > >>> 1) mean raise warning, return NaN (current behavior) > >>> 2) nanmean raise warning, return NaN > >> > >> I have mixed feelings about the whole np.seterr apparatus, but since > >> it exists, shouldn't we use it for consistency? I.e., just do whatever > >> numpy is set up to do with 0/0? (Which I think means, warn and return > >> NaN by default, but this can be changed.) > >> > >>> var, std, nanvar, nanstd > >>> > >>> 1) if ddof > axis(axes) size, raise error, probably a program bug. > >>> 2) If ddof=0, then whatever is the case for mean, nanmean > >>> > >>> For nanvar, nanstd it is possible that some slice are good, some bad, > so > >>> > >>> option1 > >>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice > >>> > >>> option2 > >>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice > >> > >> I don't really have any intuition for these ddof cases. Just raising > >> an error on negative effective dof is pretty defensible and might be > >> the safest -- it's a easy to turn an error into something sensible > >> later if people come up with use cases... > > > > related why does reduceat not have empty slices? > > > >>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) > > array([ 6, 4, 11, 7, 7]) > > > > > > I'm in favor of returning nans instead of raising exceptions, except > > if the return type is int and we cannot cast nan to int. > > > > If we get functions into numpy that know how to handle nans, then it > > would be useful to get the nans, so we can work with them > > > > Some cases where this might come in handy are when we iterate over > > slices of an array that define groups or category levels with possible > > empty groups *) > > > >>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) > >>>> x = np.arange(9) > >>>> [x[idx==ii].mean() for ii in range(4)] > > [1.5, 5.0, nan, 7.5] > > > > instead of > >>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] > > [1.5, 5.0, 7.5] > > > > same for var, I wouldn't have to check that the size is larger than > > the ddof (whatever that is in the specific case) > > > > *) groups could be empty because they were defined for a larger > > dataset or as a union of different datasets > > background: > > I wrote several robust anova versions a few weeks ago, that were > essentially list comprehension as above. However, I didn't allow nans > and didn't check for minimum size. > Allowing for empty groups to return nan would mainly be a convenience, > since I need to check the group size only once. > > ddof: tests for proportions have ddof=0, for regular t-test ddof=1, > for tests of correlation ddof=2 IIRC > so we would need to check for the corresponding minimum size that n-ddof>0 > > "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) > which is always non-negative but might result in a zero-division > error. :) > > I don't think making anything conditional on ddof>0 is useful. > > So how would you want it?
To summarize the problem areas: 1) What is the sum of an empty slice? NaN or 0? 2) What is mean of empy slice? NaN, NaN and warn, or error? 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error? 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error? I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the warning can be turned into an error by the user. The errstate context manager would be good for that. Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
