[Python-ideas] Re: Fix statistics.median()?

Richard Damon Fri, 27 Dec 2019 07:27:11 -0800

On 12/26/19 5:23 PM, Andrew Barnert via Python-ideas wrote:

On Dec 26, 2019, at 12:36, Richard Damon <[email protected]> wrote:

On 12/26/19 2:10 PM, Andrew Barnert via Python-ideas wrote:

On Dec 26, 2019, at 10:58, Richard Damon <[email protected]> wrote:
Note, that NaN values are somewhat rare in most programs, I think they can only come 
about by explicitly requesting them (like float("nan") ) or perhaps with some 
of the more advanced math packages

You can get them easily just from math itself.


Or, once you can get infinite values, you can easily get nan values with just 
basic arithmetic:

     >>> 1e1000 - 1e1000
     nan

I guess I didn't try hard enough to get a Nan. But once the Newbie has hit 
infinities, NO answer is right.

I don’t think that’s true. Surely the median of (-inf, 1, 2, 3, inf, inf, inf) 
is well defined and can only be 3?

The only case where it’s a problem is when all the values are infinite and 
exactly half of them are positive, in which case the median has to be halfway 
between -inf and inf. But even then, the only reasonable answers are nan or an 
exception.

But you seem to assume that a program to compute the median is likelythe only function of that program. But the fact that the programmer hasoverflowed values and then did some math with the overflow values startsto lead to all types of strangeness, (and you don't need to even get toinfinities to get strangeness with math, for many values of x, we canhave x being equal to x+1).

The number could have been 1e1000 - 1e999 (and thus should be big) or 1e999 - 
1e1000 (and thus should be very negative) or 1e1000 - 1e1000 (and thus should 
be zero), which is why we get a NaN here.

Well, here both numbers are clearly 1e1000, and the right answer is 0. The 
problem is that (in systems where float is IEEE double) that number can’t be 
represented as a float in the first place, so Python approximates it with inf, 
so you (inaccurately, but predictably and understandably) get nan instead of 0. 
It’s like a very extreme case of “float rounding error”.

If you have actual infinite values instead, then nan or an exception is the 
only appropriate answer in the first place, because subtraction is undefined. 
(Assuming you’re taking the floats as an approximate model of the affinely 
extended reals. If you’re taking them as a model of themselves, then it is well 
defined, as nan.)

And that is my point, IF we have decided that we need to protect thenewbie, then at the point we have converted 1e1000 to inf we have puthim on the path of problems. Fixing just median is like taking a leakyboat and bailing ONE bucket of water out of it.

If you are really worried about a median with values like this confusing 
someone, then we should handle the issue MUCH earlier, maybe even trapping the 
overflow with an error message unless taken out of 'newbie' mode.

This amounts to an argument that in ‘newbie’ mode there should be no inf or nan 
values in float in the first place, and anything that returns one should 
instead raise an OverflowError or MathDomainError. Which is actually what many 
functions actually do, but I don’t think anyone has tried to divide existing 
functions into ‘newbie’ mode and ‘float programmer mode’ functions, so trying 
to do the same with new higher-level functions on top of them is probably a 
mug’s game. (You can use Decimal with an appropriate context to get that kind 
of behavior, but I don’t think any newbie would know how to even begin doing 
that…)

Plus, as mentioned at the top, taking a median with some infinite values 
usually makes perfectly good sense that a newbie can understand. It’s not the 
same as taking a median with some nan values, which behaves in a way that only 
makes sense if you think through how sorting works.

As I have been saying, fixing *median* is the wrong spot to fix it, asthere are many similar traps in the system. If we really want to protectthe newbie from this sort of error, and not treat it as a teachablemoment, then we need to make a more fundamental change.

One option is to make floating math by default safer, and require somespecial statement added to the program to enable the extra features. Oneproblem is that you can't totally protect from these issues, as long asyou use floats, the value of numbers will not always be precise andround off errors will accumulate, but perhaps making overflow 'noisy' bysignalling would catch some of the more confusing parts (and Pythonalready does some of these, like 0/0 is an error, not a NaN.). The costof this would be a bit of efficiency, as there would need to be sometest if we are in simple or advanced mode or a check for the overflow ateach operation, and people who know what they are doing and WANT thesupport for full IEEE mode would need to add something to their program(or environment maybe).


--
Richard Damon
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/JAAHYKOVIR5AMIF4NY2WFPRNZDFUCW4P/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to