On Thu, Aug 26, 2021 at 09:36:27AM +0200, Marc-Andre Lemburg wrote:

> Indeed. The NAN handling in median() looks like a bug, more than
> anything else:

[slightly paraphrased]
> >>> l1 = [1,2,nan,4]
> >>> l2 = [nan,1,2,4]
>
> >>> statistics.median(l1)
> nan
> >>> statistics.median(l2)
> 1.5

Looks can be deceiving, it's actually a feature *wink*

That behaviour is actually the direct consequence of NANs being 
unordered. The IEEE-754 standard requires that comparisons with NANs all 
return False (apart from not-equal, which returns True). So NANs are 
neither less than, equal to, or greater than other values.

Which makes sense numerically, NANs do not appear on the number line and 
are not ordered with numbers.

So when you sort a list containing NANs, they end up in some arbitrary 
position that depends on the sort implementation, the other values in 
the list, and their initial position. NANs can even throw out the order 
of other values:

>>> sorted([3, nan, 4, 2, nan, 1])
[3, nan, 1, 2, 4, nan]

and *that* violates `median`'s assumption that sorting values actually 
puts them in sorted order, which is why median returns the wrong value.

I don't think that Timsort is buggy here. I expect that every sort 
algorithm on the planet will require a Total Order to get sensible 
results, and NANs violate that expectation.

https://eli.thegreenplace.net/2018/partial-and-total-orders/

If we define the less than operator `<` as "isn't greater than (or equal 
to)", then we can see that sorted is *locally* correct:

* 3 isn't greater than nan;
* nan isn't greater than 1;
* 1 isn't greater than 2;
* 2 isn't greater than 4;
* and 4 isn't greater than nan.

sorted() has correctly sorted the values in the sense that the invariant 
"a comes before b iff a isn't greater than b" is satisfied between each 
pair of consecutive values, but globally the order is violated because 
NAN's are unordered and mess up transitivity:

    3 isn't greater than NAN, and NAN isn't greater than 1, 
    but it is not true that 3 isn't greater than 1.

In the general case of sorting elements, I think that the solution is 
"don't do that". If you have objects which don't form a total order, 
then you can't expect to get sensible results from sorting them.

In the case of floats, it would be nice to have a totalOrder function as 
specified in the 2008 revision of IEEE-354:

https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf

Then we could sensibly do:

    sorted(floats_or_decimals, key=totalorder)

and at least NANs would end up in a consistent place and everything else 
sorted correctly.


-- 
Steve
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/I637COPLQZH4A5EDZ6YZ3RJ53SEUMGSF/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to