As I was saying, the issue is that statistics.median can deal with many
types and to have it special case for nan would be awkward. The user
could also have done something like use None values (but this does give
an error).
Perhaps where the test could be done would be in the built in function
sorted which median uses, as all the arguments about confusing casual
users with median also apply to the use of sorted or most other
operations that use sorted internally. sorted also needs to naturally
examine all the elements, while median doesn't, it just needs to get the
sorted list and look at the middle value(s).
The biggest part of the error comes from the fact that IEEE Floating
point has a strong requirement on how to handle the NaN value, and that
isn't so intuitive to the casual user. The answers basically are the
following:
1) Not be IEEE compliant, which would bring complaints from everyone who
wants to do serious work with math.
2) Not consider Floats to be sortable, which seems to be a bit of
throwing the baby out with the bath water.
3) Add tests to places like sorted (slowing things down some) to catch
this case, and deal with it (but you then need to decide HOW you want to
deal with it).
4) Ignore the problem, and let NaNs generate some weird results.
Python seems to have generally chosen 4, and in some more advanced
packages 3. The question comes does adding the cost to have sorted test
all the values for the cases of not being properly comparable, affecting
every sort done, provide enough benefit to be worth it.
Note, that NaN values are somewhat rare in most programs, I think they
can only come about by explicitly requesting them (like float("nan") )
or perhaps with some of the more advanced math packages, and users of
those should probably understand NaN values (Python throws errors for
most simple math that might otherwise generate a NaN, like 0/0 or
sqrt(-1) ). That means that if they don't know what a NaN is, they
probably won't be dealing with them.
On 12/26/19 12:42 PM, David Mertz wrote:
Well, *I* know the implementation. And I know about NaN being neither
less than or greater than anything else (even itself). And I know the
basic working of Timsort.
But a lot of other folks, especially beginners or casual users, don't
know all that. The do know that fractional numbers are a thing one is
likely to want a median of (whether or not they know IEEE-754
intimately). And they may or may not know that not-a-number is a
float, but it's not that hard to arrive at by a computation.
Even if documentation vaguely hints at the behavior, it's a source of
likely surprise. The fact that the median might be the very largest of
a bunch of numbers (with at least one NaN in the collection) is surely
not desired behavior, even if explainable.
Or e.g. two sets that compare as equal can have different medians
according to the statistics module. I can construct that example if
needed.
On Thu, Dec 26, 2019, 12:27 PM Richard Damon
Note that the statistics module documentation implies the issue,
as median implies that it requires the sequence to be orderable,
and nan isn't orderable. Since the statistics module seems to be
designed to handle types other than floats, detecting nan values
is extra expensive, so I think it can be excused for not checking.
--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/RGRMC34CGLYVNFBQWN4BJOVI3JT4KCCA/
Code of Conduct: http://python.org/psf/codeofconduct/