On Sat, Dec 28, 2019 at 10:16:28PM -0800, Christopher Barker wrote: > Richard: I am honestly confused about what you think we should do. Sure, > you can justify why the statistics module doesn’t currently handle NaN’s > well, but that doesn’t address the question of what it should do. > > As far as I can tell, the only reasons for the current approach is ease of > implementation and performance. Which are fine reasons, and why it was done > that way in the first place.
Actually, the reason I didn't specify the behaviour with NANs or make any guarantees one way or another was that I wasn't sure what behaviour, or behaviours, would be desirable. I didn't want to lock in one behaviour and get it wrong, or impose my own preference without some real-world usage. (In the case of mode, I did get it wrong: raising an exception in the case of multi-modal data turned out to be annoying and less useful than I hoped. Raymond Hettinger convinced me to change the behaviour, based on real-world feedback and use-cases.) > But there seems to be (mostly) a consensus that it would be good to better > handle NaNs in the statistics module. > > I think the thing to do is decide what we want NaNs to mean: should they be > interpreting as missing values or, essentially, errors. Missing values aren't errors :-) I haven't finished reading the entire thread yet, but I don't think we're going to reach a concensus as to what the One Correct Thing to do with NANs. (1) Some people like the fact that NANs propagate through their calculations without halting computation; after all, that's why they were invented in the first place. (2) Some people prefer an immediate failure (that's why signalling NANs were invented, but I think it was William Kahan who describes signalling NANs as a "compromise" that nobody uses in practice). Exceptions in Python are easier to handle than signals in a low-level language like Fortran or C, which makes this option more practical. (3) Some people are dealing with datasets that use NANs as missing values. This is not "pure" (missing values weren't a motivating use-case for NANs), and arguably it's not "best practice" (a NAN could conceivably creep into your data as a calculation artifact, in which case you might not want to ignore that specific NAN) but it seems to work well enough in practice that this is very common. (4) Some people may not want to pay the cost of handling NANs inside the statistics module, since they've already ensured there are no NANs in their data. So I think we want to give the caller the option to choose the behaviour that works for them and their data set. Proposal: Add a keyword only parameter to the statistics functions to choose a policy. The policy could be: RETURN RAISE IGNORE NONE (or just None?) I hope the meanings are obvious, but in case they aren't: (1) RETURN means to return a NAN if the data contains any NANs; this is "NAN poisoning". (2) RAISE means to raise an exception if the data contains any NANs. (3) IGNORE means to ignore any NANs and skip over them, as if they didn't exist in the data set. (4) NONE means no policy is in place, and the behaviour with NANs is implementation dependent. In practice one should choose NONE only if you were sure you didn't have any NANs in your data and wanted the extra speed of skipping any NAN checks. I've been playing with some implementations, and option (3) could conceivably be relegated to a mere recipe: statistics.mean(x for x in data if not isnan(x)) but options (1) and (2) are not so easy for the caller. In practice, the statistics functions have to handle those cases themselves. At which point, once they are handling cases (1) and (2) it is no extra work to also handle (3) and take the burden off the caller. The fine print: - In the above, whenever I used the term "NAN", I mean a quiet NAN. - Signalling NANs should always raise immediately regardless of the policy. That's the intended semantics of sNANs. - In practice, there are no platform-independent or reliable ways to get float sNANs in Python, and they never come up except in contrived examples. I don't propose to officially support float sNANs until Python makes some reliable guarantees for them. If you manage to somehow put a float sNAN in your data, you get whatever happens to happen. - Decimal sNANs are another story, and should do the right thing. Question: what should this parameter be called? What should its default behaviour be? nan_policy=RAISE > You’ve made a good case that None is the “right” thing to use for missing > values — and could be used with int and other types. So yes, if the > statistics module were to grow support for missing values, that could be > the way to do it. Regardless of whether None is a better "missing value" than NAN, some data sets will have already used NANs as missing values. E.g. it is quite common in both R and pandas. pandas also uses None as a missing value, and R has a special NA constant. The problem with None is that, compared to NANs, it is too easy for None to accidentally creep into your data. (Python functions return None by default, so it is easy to accidentally return None without intending to. It is hard to accidentally return NAN without intending to, since so few things in Python return a NAN.) With "None is a missing value" semantics, there are only two options: - ignore None (it is a missing value, so it should be skipped) - don't ignore None, and raise TypeError if you get one The first option is easy for the caller to do using a simple recipe: statistics.mean(x for x in data if x is not None) which is explicit, easy to understand and easy to remember. We get the second option for free, by the nature of None: you can't sort mixed numeric+None data, and you can't do arithmetic on None. So I don't think that the statistics functions need an additional parameter to ignore None. (Unlike NAN policies.) > Frankly, I’m also confused as to why folks seem to think this is an issue > to be addressed in the sort() functions — those are way too general and low > level to be expected to solve this. And it would be a much heavier lift to > make a change that central to Python anyway. Indeed! Solving this in sort is the wrong solution. Only median() cares about sort, and even that is an implementation detail. -- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/J7ED4564W3PJAXVX5B24UEGNQGH6HQ64/ Code of Conduct: http://python.org/psf/codeofconduct/