Thanks everyone commenting on this thread. I haven't quite read it all 
yet (I will) but I wanted to get a few comments now.


On Thu, Dec 26, 2019 at 10:31:00AM -0500, David Mertz wrote:

> Anyway, I propose that the obviously broken version of
> `statistics.median()` be replaced with a better implementation.

To be precise, the problem is not just the implementation, but the 
interface, as median is explicitly noted to require orderable data. Data 
with NANs is not orderable. Richard is correct: this is a case of 
garbage in, garbage out: if you ignore the documented requirements, 
you'll get garbage results.

However, I am happy to accept that silent failure may not be the ideal 
result for everyone. Unfortunately, there is no consensus on what the 
ideal result is, with at least four valid responses:

- the status quo: the caller is responsible for dealing with NANs, 
  just as they are responsible for dealing with unorderable values 
  passed to min, max, sort, etc. If you know that there are no NANs
  in your data, any extra processing to check for NANs is just 
  wasted effort.

- NANs represent missing values, so they should be ignored;

- the presence of a NAN is an error, and should raise an exception;

- NANs should propogate through the calculation, a NAN anywhere in your 
  data should return NAN (this is sometimes called "nan poisoning").

Also note that NANs are not just a problem for median. They are a 
problem for all order statistics, including percentiles, quartiles and 
general quantiles. Python 3.8 adds a quantiles function which has the 
same problem:

    py> statistics.quantiles([NAN, 3, 4, 7, 5])
    [nan, 4.0, 6.0]
    py> statistics.quantiles([3, 4, 7, NAN, 5])
    [3.5, nan, nan]


NANs aren't as big a problem for other functions like mean and stdev, 
but the caller may still want to make the choice of ignore, raise or 
return a NAN. So I would like to avoid an ad hoc response to NANs in 
median alone, and treat them consistently across the entire module.

Marco, you don't have to use median_low and median_high if you don't 
like them, but they aren't any worse than any other choice for 
calculating order statistics. All order statistics (apart from min and 
max) require you to sometimes make a choice between returning a data 
value or interpolating between two data values, and in general there are 
*lots* of choices. Here are just a few of them:

"Sample Quantiles in Statistical Packages", Hyndman & Fan, The American
Statistician 1996, Vol 50, No 4, pp. 361-365.
https://www.amherst.edu/media/view/129116/original/Sample+Quantiles.pdf

"Quartiles in Elementary Statistics", Langford, Journal of Statistics
Education Volume 14, Number 3 (2006).
http://www.amstat.org/publications/jse/v14n3/langford.html

For median, there are only three choices when the midpoint falls between 
two values: the lower value, the higher value, and the average between 
the two. All three choices have their pros and cons.


-- 
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HWKLWDBXOLMTLLLDODSJZ6PTBWYOTEGB/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to