As I was saying, the issue is that statistics.median can deal with many types and to have it special case for nan would be awkward. The user could also have done something like use None values (but this does give an error).

Perhaps where the test could be done would be in the built in function sorted which median uses, as all the arguments about confusing casual users with median also apply to the use of sorted or most other operations that use sorted internally. sorted also needs to naturally examine all the elements, while median doesn't, it just needs to get the sorted list and look at the middle value(s).

The biggest part of the error comes from the fact that IEEE Floating point has a strong requirement on how to handle the NaN value, and that isn't so intuitive to the casual user.  The answers basically are the following: 1) Not be IEEE compliant, which would bring complaints from everyone who wants to do serious work with math. 2) Not consider Floats to be sortable, which seems to be a bit of throwing the baby out with the bath water. 3) Add tests to places like sorted (slowing things down some) to catch this case, and deal with it (but you then need to decide HOW you want to deal with it).
4) Ignore the problem, and let NaNs generate some weird results.

Python seems to have generally chosen 4, and in some more advanced packages 3. The question comes does adding the cost to have sorted test all the values for the cases of not being properly comparable, affecting every sort done, provide enough benefit to be worth it.

Note, that NaN values are somewhat rare in most programs, I think they can only come about by explicitly requesting them (like float("nan") ) or perhaps with some of the more advanced math packages, and users of those should probably understand NaN values (Python throws errors for most simple math that might otherwise generate a NaN, like 0/0 or sqrt(-1) ). That means that if they don't know what a NaN is, they probably won't be dealing with them.

On 12/26/19 12:42 PM, David Mertz wrote:
Well, *I* know the implementation. And I know about NaN being neither less than or greater than anything else (even itself). And I know the basic working of Timsort.

But a lot of other folks, especially beginners or casual users, don't know all that.  The do know that fractional numbers are a thing one is likely to want a median of (whether or not they know IEEE-754 intimately). And they may or may not know that not-a-number is a float, but it's not that hard to arrive at by a computation.

Even if documentation vaguely hints at the behavior, it's a source of likely surprise. The fact that the median might be the very largest of a bunch of numbers (with at least one NaN in the collection) is surely not desired behavior, even if explainable.

Or e.g. two sets that compare as equal can have different medians according to the statistics module. I can construct that example if needed.

On Thu, Dec 26, 2019, 12:27 PM Richard Damon

    Note that the statistics module documentation implies the issue,
    as median implies that it requires the sequence to be orderable,
    and nan isn't orderable. Since the statistics module seems to be
    designed to handle types other than floats, detecting nan values
    is extra expensive, so I think it can be excused for not checking.

--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RGRMC34CGLYVNFBQWN4BJOVI3JT4KCCA/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to