[Python-ideas] Re: Fix statistics.median()?

Richard Damon Thu, 26 Dec 2019 11:01:19 -0800

As I was saying, the issue is that statistics.median can deal with manytypes and to have it special case for nan would be awkward. The usercould also have done something like use None values (but this does givean error).

Perhaps where the test could be done would be in the built in functionsorted which median uses, as all the arguments about confusing casualusers with median also apply to the use of sorted or most otheroperations that use sorted internally. sorted also needs to naturallyexamine all the elements, while median doesn't, it just needs to get thesorted list and look at the middle value(s).

The biggest part of the error comes from the fact that IEEE Floatingpoint has a strong requirement on how to handle the NaN value, and thatisn't so intuitive to the casual user. The answers basically are thefollowing:1) Not be IEEE compliant, which would bring complaints from everyone whowants to do serious work with math.2) Not consider Floats to be sortable, which seems to be a bit ofthrowing the baby out with the bath water.3) Add tests to places like sorted (slowing things down some) to catchthis case, and deal with it (but you then need to decide HOW you want todeal with it).

4) Ignore the problem, and let NaNs generate some weird results.

Python seems to have generally chosen 4, and in some more advancedpackages 3. The question comes does adding the cost to have sorted testall the values for the cases of not being properly comparable, affectingevery sort done, provide enough benefit to be worth it.

Note, that NaN values are somewhat rare in most programs, I think theycan only come about by explicitly requesting them (like float("nan") )or perhaps with some of the more advanced math packages, and users ofthose should probably understand NaN values (Python throws errors formost simple math that might otherwise generate a NaN, like 0/0 orsqrt(-1) ). That means that if they don't know what a NaN is, theyprobably won't be dealing with them.


On 12/26/19 12:42 PM, David Mertz wrote:

Well, *I* know the implementation. And I know about NaN being neitherless than or greater than anything else (even itself). And I know thebasic working of Timsort.
But a lot of other folks, especially beginners or casual users, don'tknow all that. The do know that fractional numbers are a thing one islikely to want a median of (whether or not they know IEEE-754intimately). And they may or may not know that not-a-number is afloat, but it's not that hard to arrive at by a computation.
Even if documentation vaguely hints at the behavior, it's a source oflikely surprise. The fact that the median might be the very largest ofa bunch of numbers (with at least one NaN in the collection) is surelynot desired behavior, even if explainable.
Or e.g. two sets that compare as equal can have different mediansaccording to the statistics module. I can construct that example ifneeded.
On Thu, Dec 26, 2019, 12:27 PM Richard Damon

    Note that the statistics module documentation implies the issue,
    as median implies that it requires the sequence to be orderable,
    and nan isn't orderable. Since the statistics module seems to be
    designed to handle types other than floats, detecting nan values
    is extra expensive, so I think it can be excused for not checking.

--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RGRMC34CGLYVNFBQWN4BJOVI3JT4KCCA/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to