Would these policies be named as strings or with an enum? Following Pandas, we'd probably support both. I won't bikeshed the names, but they seem to cover desired behaviors.
On Sun, Jan 6, 2019, 7:28 PM Steven D'Aprano <st...@pearwood.info wrote: > Bug #33084 reports that the statistics library calculates median and > other stats wrongly if the data contains NANs. Worse, the result depends > on the initial placement of the NAN: > > py> from statistics import median > py> NAN = float('nan') > py> median([NAN, 1, 2, 3, 4]) > 2 > py> median([1, 2, 3, 4, NAN]) > 3 > > See the bug report for more detail: > > https://bugs.python.org/issue33084 > > > The caller can always filter NANs out of their own data, but following > the lead of some other stats packages, I propose a standard way for the > statistics module to do so. I hope this will be uncontroversial (he > says, optimistically...) but just in case, here is some prior art: > > (1) Nearly all R stats functions take a "na.rm" argument which defaults > to False; if True, NA and NAN values will be stripped. > > (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument > which specifies what to do if a NAN is seen in the data. > > > https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html > > (3) At least some Matlab functions, such as mean(), take an optional > flag that determines whether to ignore NANs or include them. > > https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag > > > I propose adding a "nan_policy" keyword-only parameter to the relevant > statistics functions (mean, median, variance etc), and defining the > following policies: > > IGNORE: quietly ignore all NANs > FAIL: raise an exception if any NAN is seen in the data > PASS: pass NANs through unchanged (the default) > RETURN: return a NAN if any NAN is seen in the data > WARN: ignore all NANs but raise a warning if one is seen > > PASS is equivalent to saying that you, the caller, have taken full > responsibility for filtering out NANs and there's no need for the > function to slow down processing by doing so again. Either that, or you > want the current implementation-dependent behaviour. > > FAIL is equivalent to treating all NANs as "signalling NANs". The > presence of a NAN is an error. > > RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a > calculation causes it to return a NAN, allowing NANs to propogate > through multiple calculations. > > IGNORE and WARN are the same, except IGNORE is silent and WARN raises a > warning. > > Questions: > > - does anyone have an serious objections to this? > > - what do you think of the names for the policies? > > - are there any additional policies that you would like to see? > (if so, please give use-cases) > > - are you happy with the default? > > > Bike-shed away! > > > > -- > Steve > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/