[Python-ideas] Re: Fix statistics.median()?

Steven D'Aprano Thu, 26 Dec 2019 17:42:58 -0800

Thanks everyone commenting on this thread. I haven't quite read it all 
yet (I will) but I wanted to get a few comments now.

On Thu, Dec 26, 2019 at 10:31:00AM -0500, David Mertz wrote:

> Anyway, I propose that the obviously broken version of
> `statistics.median()` be replaced with a better implementation.

To be precise, the problem is not just the implementation, but the
interface, as median is explicitly noted to require orderable data. Data
with NANs is not orderable. Richard is correct: this is a case of
garbage in, garbage out: if you ignore the documented requirements,
you'll get garbage results.

However, I am happy to accept that silent failure may not be the ideal
result for everyone. Unfortunately, there is no consensus on what the
ideal result is, with at least four valid responses:

- the status quo: the caller is responsible for dealing with NANs,
just as they are responsible for dealing with unorderable values
passed to min, max, sort, etc. If you know that there are no NANs
in your data, any extra processing to check for NANs is just
wasted effort.

- NANs represent missing values, so they should be ignored;

- the presence of a NAN is an error, and should raise an exception;

- NANs should propogate through the calculation, a NAN anywhere in your
data should return NAN (this is sometimes called "nan poisoning").

Also note that NANs are not just a problem for median. They are a
problem for all order statistics, including percentiles, quartiles and
general quantiles. Python 3.8 adds a quantiles function which has the
same problem:

py> statistics.quantiles([NAN, 3, 4, 7, 5])
[nan, 4.0, 6.0]
py> statistics.quantiles([3, 4, 7, NAN, 5])
[3.5, nan, nan]

NANs aren't as big a problem for other functions like mean and stdev,
but the caller may still want to make the choice of ignore, raise or
return a NAN. So I would like to avoid an ad hoc response to NANs in
median alone, and treat them consistently across the entire module.

Marco, you don't have to use median_low and median_high if you don't
like them, but they aren't any worse than any other choice for
calculating order statistics. All order statistics (apart from min and
max) require you to sometimes make a choice between returning a data
value or interpolating between two data values, and in general there are
*lots* of choices. Here are just a few of them:

"Sample Quantiles in Statistical Packages", Hyndman & Fan, The American
Statistician 1996, Vol 50, No 4, pp. 361-365.
https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf

"Quartiles in Elementary Statistics", Langford, Journal of Statistics
Education Volume 14, Number 3 (2006).
http://www.amstat.org/publications/jse/v14n3/langford.html

For median, there are only three choices when the midpoint falls between
two values: the lower value, the higher value, and the average between
the two. All three choices have their pros and cons.

--
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/HWKLWDBXOLMTLLLDODSJZ6PTBWYOTEGB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to