[Python-ideas] Re: Fix statistics.median()?

Steven D'Aprano Sun, 29 Dec 2019 18:32:54 -0800

On Sat, Dec 28, 2019 at 10:16:28PM -0800, Christopher Barker wrote:

> Richard: I am honestly confused about what you think we should do. Sure,
> you can justify why the statistics module doesn’t currently handle NaN’s
> well, but that doesn’t address the question of what it should do.
> 
> As far as I can tell,  the only reasons for the current approach is ease of
> implementation and performance. Which are fine reasons, and why it was done
> that way in the first place.


Actually, the reason I didn't specify the behaviour with NANs or make 
any guarantees one way or another was that I wasn't sure what behaviour, 
or behaviours, would be desirable. I didn't want to lock in one 
behaviour and get it wrong, or impose my own preference without some 
real-world usage.

(In the case of mode, I did get it wrong: raising an exception in the 
case of multi-modal data turned out to be annoying and less useful than 
I hoped. Raymond Hettinger convinced me to change the behaviour, based 
on real-world feedback and use-cases.)


> But there seems to be (mostly) a consensus that it would be good to better
> handle NaNs in the statistics module.
> 
> I think the thing to do is decide what we want NaNs to mean: should they be
> interpreting as missing values or, essentially, errors.

Missing values aren't errors :-)

I haven't finished reading the entire thread yet, but I don't think 
we're going to reach a concensus as to what the One Correct Thing to do 
with NANs.

(1) Some people like the fact that NANs propagate through their 
calculations without halting computation; after all, that's why they 
were invented in the first place.

(2) Some people prefer an immediate failure (that's why signalling NANs 
were invented, but I think it was William Kahan who describes signalling 
NANs as a "compromise" that nobody uses in practice). Exceptions in 
Python are easier to handle than signals in a low-level language like 
Fortran or C, which makes this option more practical.

(3) Some people are dealing with datasets that use NANs as missing 
values. This is not "pure" (missing values weren't a motivating use-case 
for NANs), and arguably it's not "best practice" (a NAN could 
conceivably creep into your data as a calculation artifact, in which 
case you might not want to ignore that specific NAN) but it seems to 
work well enough in practice that this is very common.

(4) Some people may not want to pay the cost of handling NANs inside the 
statistics module, since they've already ensured there are no NANs in 
their data.

So I think we want to give the caller the option to choose the behaviour 
that works for them and their data set.

Proposal: 

Add a keyword only parameter to the statistics functions to choose a 
policy. The policy could be:

    RETURN
    RAISE
    IGNORE
    NONE (or just None?)

I hope the meanings are obvious, but in case they aren't:

(1) RETURN means to return a NAN if the data contains any NANs; this is 
"NAN poisoning".

(2) RAISE means to raise an exception if the data contains any NANs.

(3) IGNORE means to ignore any NANs and skip over them, as if they 
didn't exist in the data set.

(4) NONE means no policy is in place, and the behaviour with NANs is 
implementation dependent.

In practice one should choose NONE only if you were sure you didn't have 
any NANs in your data and wanted the extra speed of skipping any NAN 
checks.

I've been playing with some implementations, and option (3) could 
conceivably be relegated to a mere recipe:

    statistics.mean(x for x in data if not isnan(x))

but options (1) and (2) are not so easy for the caller. In practice, the 
statistics functions have to handle those cases themselves. At which 
point, once they are handling cases (1) and (2) it is no extra work to 
also handle (3) and take the burden off the caller.


The fine print:

- In the above, whenever I used the term "NAN", I mean a quiet NAN.

- Signalling NANs should always raise immediately regardless of the 
policy. That's the intended semantics of sNANs.

- In practice, there are no platform-independent or reliable ways to get 
float sNANs in Python, and they never come up except in contrived 
examples. I don't propose to officially support float sNANs until Python 
makes some reliable guarantees for them. If you manage to somehow put a 
float sNAN in your data, you get whatever happens to happen.

- Decimal sNANs are another story, and should do the right thing.


Question: what should this parameter be called? What should its default 
behaviour be?

    nan_policy=RAISE


> You’ve made a good case that None is the “right” thing to use for missing
> values — and could be used with int and other types. So yes, if the
> statistics module were to grow support for missing values, that could be
> the way to do it.

Regardless of whether None is a better "missing value" than NAN, some 
data sets will have already used NANs as missing values. E.g. it is 
quite common in both R and pandas.

pandas also uses None as a missing value, and R has a special NA 
constant. The problem with None is that, compared to NANs, it is too 
easy for None to accidentally creep into your data. (Python functions 
return None by default, so it is easy to accidentally return None 
without intending to. It is hard to accidentally return NAN without 
intending to, since so few things in Python return a NAN.)

With "None is a missing value" semantics, there are only two options:

- ignore None (it is a missing value, so it should be skipped)
- don't ignore None, and raise TypeError if you get one

The first option is easy for the caller to do using a simple recipe:

    statistics.mean(x for x in data if x is not None)

which is explicit, easy to understand and easy to remember. We get the 
second option for free, by the nature of None: you can't sort mixed 
numeric+None data, and you can't do arithmetic on None.

So I don't think that the statistics functions need an additional 
parameter to ignore None. (Unlike NAN policies.)


> Frankly, I’m also confused as to why folks seem to think this is an issue
> to be addressed in the sort() functions — those are way too general and low
> level to be expected to solve this. And it would be a much heavier lift to
> make a change that central to Python anyway.

Indeed! Solving this in sort is the wrong solution. Only median() 
cares about sort, and even that is an implementation detail.


-- 
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/J7ED4564W3PJAXVX5B24UEGNQGH6HQ64/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to