On 12/29/19 7:05 PM, Christopher Barker wrote:
On Sun, Dec 29, 2019 at 3:26 PM Richard Damon
<rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:
> Frankly, I’m also confused as to why folks seem to think this is an
> issue to be addressed in the sort() functions
The way I see it, is that median doesn't handle NaNs in a reasonable
way, because sorted doesn't handle them,
I don't think so -- it doesn't handle NaNs because it takes a decision
about how they should be handled, and code to write; maybe more code
because you can't use the bare sort() functions, but sort will never
solve the problem both generically and properly by itself.
It doesn't handle NaNs because it decided to be a simple routine using
the basic definition, the middle value based on the basic sort. I would
expect that the basic sort routine has a possibility that it has been
optimized by dropping down parts into raw C for speed, while
because it is easy and quick to
not handle NaN, and to handle them you need to define an Official
meaning for them, and there are multiple reasonable meanings.
exactly.
The reason
to push most solutions to sorted, is that except for ignore, which
can
easily be implemented as a data filter to the input of the
function, the
exact same problem occurs in multiple functions (in the statistics
module, that would include quantile) so by the principle of DRY,
that is
the logical place to implement the solution (if we don't implement
the
solution as an input filter)
well, no -- the logical place for DRY is to use the SAME sort
implementation for all functions in the statistics module that need a
sort. It only makes sense to try to push this to the standard sort if
it were to be used, in the same way, but many other uses od sort, and
it didn't break any current uses. ON the other hand, saying "this is
how the statistics module interprets NaNs, and how things will be
sorted" is a localized -- it does not require it be useful for
anything else, and it will, by definition, not break any code that
doesn't use the statistics module.
At its beginning, the statistics module disclaims being a complete
all
encompassing statistics package,
sure -- but that doesn't mean it couldn't be more complete than it
currently is.
If being more complete is 'simple', yes, but it doesn't look to be
(unless the fix goes into sorted)
and suggests using one if you need more
advanced features, which I would consider most processing of NaN
to be
included in.
That's a perfectly valid opinion, but while I think that perhaps
"handling missing values" could be considered advanced, I'm not sure
"giving a correct and meaningful answer for all values of expressly
supported data types is "advanced" -- in a way, quite the opposite --
it's less "advanced" coders, ones that are not thinking about where
NaNs might appear, and what the implication of that is, that are going
to be bitten by the current implementation.
Docs can help, but I think we can, and should, do better than that --
after all it's well known that "no one reads documentation".
Which is EXACTLY the reason I say that if this is important enough to
fix in median, it is important enough to fix in sorted. sorted gives
exactly the same nonsense result, it is only a bit more obvious because
it gives all the points. Is [3, nan, 1, 2, 4] a sorted list?
One big reason to NOT fix the issue with NaNs in median is
that such a fix likely has a measurable impact ction the
processing of
the median.
You mean performance? Sure, but as I've argued before (no idea if
anyone agrees with me) the statistics package is already not a high
performance package anyway. If it turns out that it slows it down by,
say, a factor of two or more, then yes, maybe we need to forget it.
I suspect that the simplest solution, and one that doesn't
impact other uses would be simple filter functions (and perhaps
median
could be defined with a arguement for what function to use, with a
None
option that would be fairly quick. One filter would remove Nans (or
None), one would throw an exception if there is a Nan, and another
would
just return the sequence [nan] if there are any NaNs in the input
sequence (so the median would be nan). The same options could be
added
other operations like quantile which has the similar issue, and made
available to the program for other use.
I agree -- this could be a good way to go.
There is one other option that might be possible to fix sorted,
<snip> see the last post if you want the details ...
This would say that sorted would work with NaNs, but for median most
NaNs are treated as more positive than infinity, so the median is
biased, but at least you don't get absurd results.
yeah, but this is what I meant above -- you'd still want to check for
NaNs in the statistics functions. Though It would be a fast check,
'cause you could check only the ones on the end after sorting.
But the biggest barrier is that it would be a fair bit of churn on the
sort() functions (and the float class), and would only help for floats
anyway. If someone want to propose this, please do -- but I don't
think we should wait for that to do something with the statistics
module. Also, if you want to pursue this, do go back and find the
thread about type-checked sorting -- I think this is it:
https://mail.python.org/pipermail/python-dev/2016-October/146613.html
I'm not sure if anything ever came of that.
- CHB
--
Christopher Barker, PhD
Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/75BZ5UACRE6SWUJ4C4RYH2G6AQFKN7J3/
Code of Conduct: http://python.org/psf/codeofconduct/
--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/GDTC6AD6YDWVHZGXPJLDBPXYMFYDMVHL/
Code of Conduct: http://python.org/psf/codeofconduct/