On 12/30/19 11:54 AM, David Mertz wrote:
On Mon, Dec 30, 2019 at 3:32 AM Andrew Barnert via Python-ideas
<python-ideas@python.org <mailto:python-ideas@python.org>> wrote:
On Dec 29, 2019, at 23:50, Steven D'Aprano <st...@pearwood.info
<mailto:st...@pearwood.info>> wrote:
>
> On Sun, Dec 29, 2019 at 06:23:03PM -0800, Andrew Barnert via
Python-ideas wrote:
>
>> Likewise, it’s even easier to write ignore-nan yourself than to
write the DSU yourself:
>>
>> median = statistics.median(x for x in xs if not x.isnan())
>
> Try that with xs = [1, 10**400, 2] and come back to me.
Presumably the end user (unlike the statistics module) knows what
data they have.
No, Steven is right here. In Python we might very sensibly mix
numeric datatypes. But this means we need an `is_nan()` function like
some discussed in these threads, not rely on a method (and not the
same behavior as math.isnan()).
E.g.:
my_data = {'observation1': 10**400, # really big amount
'observation2': 1, # ordinary size
'observation3': 2.0, # ordinary size
'observation4': math.nan # missing data }
median = statistics.median_high(x for x in my_data if not is_nan(x))
The answer '2.0' is plainly right here, and there's no reason we
shouldn't provide it.
My preference is that the interpretation that NaN means Missing Data
isn't appropriate for for the statistics module. In your code, because
you put the filter in, YOU added that meaning which is ok, but I see no
grounds to say that statistics.median(my_data) MUST be 2.0, and several
other logical results have been presented.
For instance, if your last point was defined as 1e400-1e399, which
results in a nan, then 2.0 is NOT the reasonable answer, but from the
numbers (before we lost precision to the subtraction of infinities) be
inf, or maybe something close to 4.5e399 had the e notation numbers not
overflow to infinity, but stayed big nums or decimals.
Since Python DOES support the mixed type arrays, I see no reason that
Python needs to adopt the ancient domain specific (and not universal in
the domain) usage of nan as missing data, but instead the Python Idiom
should more likely be something line None (which gets around the
difficulty of detecting the multiple forms on nan).
Now one issue with your example, which may be the point, is that
currently the documentation of median says it does NOT support mixed
type list, like given above, but it does seem to handle it as long as
the comparison function gives reasonable results, I suspect that there
are some combination of extreme values of differing types where the
comparison function fails, and I am not sure if there is a easy solution
to make ALL the Number classes always comparable to each other, one
issue being that what type to do the comparison in most efficiently is
value dependent (magnitude and how close the values are together).
--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/U7EOI5SVJWX7TP3HWNNB6YVPKC76VFWN/
Code of Conduct: http://python.org/psf/codeofconduct/