On 12/26/19 10:31 AM, David Mertz wrote:
This came up in discussion here before, maybe a year ago, I think.
There was a decision not to change the implementation, but that seemed
like a mistake (and the discussion was about broader things).
Anyway, I propose that the obviously broken version of
`statistics.median()` be replaced with a better implementation.
Python 3.8.0 (default, Nov 6 2019, 21:49:08)
>>> import numpy as np
>>> import pandas as pd
>>> import statistics
>>> nan = float('nan')
>>> items1 = [nan, 1, 2, 3, 4]
>>> items2 = [1, 2, 3, 4, nan]
>>> statistics.median(items1)
2
>>> statistics.median(items2)
3
>>> np.median(items1)
nan
>>> np.median(items2)
nan
>>> pd.Series(items1).median()
2.5
>>> pd.Series(items2).median()
2.5
The NumPy and Pandas answers are both "reasonable" under slightly
different philosophies of how to handle bad values. I think raising an
exception for NaNs would also be reasonable enough.
The one thing that is NOT reasonable is returning different answers
for median depending on the order of the elements.
Getting garbage answers for garbage input isn't THAT unreasonable.
Perhaps it could be argued that detecting common garbage input and
rejecting it (perhaps with an exception) would make more sense.
Note that the statistics module documentation implies the issue, as
median implies that it requires the sequence to be orderable, and nan
isn't orderable. Since the statistics module seems to be designed to
handle types other than floats, detecting nan values is extra expensive,
so I think it can be excused for not checking.
--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/OPPWCFJ7UDHXL5WEXXADXMRQFTJHEFPX/
Code of Conduct: http://python.org/psf/codeofconduct/