On 12/29/19 7:05 PM, Christopher Barker wrote:
On Sun, Dec 29, 2019 at 3:26 PM Richard Damon <rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:

    > Frankly, I’m also confused as to why folks seem to think this is an
    > issue to be addressed in the sort() functions

    The way I see it, is that median doesn't handle NaNs in a reasonable
    way, because sorted doesn't handle them,


I don't think so -- it doesn't handle NaNs because it takes a decision about how they should be handled, and code to write; maybe more code because you can't use the bare sort() functions, but sort will never solve the problem both generically and properly by itself.

It doesn't handle NaNs because it decided to be a simple routine using the basic definition, the middle value based on the basic sort. I would expect that the basic sort routine has a possibility that it has been optimized by dropping down parts into raw C for speed, while

    because it is easy and quick to
    not handle NaN, and to handle them you need to define an Official
    meaning for them, and there are multiple reasonable meanings.


exactly.

    The reason
    to push most solutions to sorted, is that except for ignore, which
    can
    easily be implemented as a data filter to the input of the
    function, the
    exact same problem occurs in multiple functions (in the statistics
    module, that would include quantile) so by the principle of DRY,
    that is
    the logical place to implement the solution (if we don't implement
    the
    solution as an input filter)


well, no -- the logical place for DRY is to use the SAME sort implementation for all functions in the statistics module that need a sort. It only makes sense to try to push this to the standard sort if it were to be used, in the same way, but many other uses od sort, and it didn't break any current uses. ON the other hand, saying "this is how the statistics module interprets NaNs, and how things will be sorted" is a localized -- it does not require it be useful for anything else, and it will, by definition, not break any code that doesn't use the statistics module.

    At its beginning, the statistics module disclaims being a complete
    all
    encompassing statistics package,


sure -- but that doesn't mean it couldn't be more complete than it currently is.

If being more complete is 'simple', yes, but it doesn't look to be (unless the fix goes into sorted)


    and suggests using one if you need more
    advanced features, which I would consider most processing of NaN
    to be
    included in.


That's a perfectly valid opinion, but while I think that perhaps "handling missing values" could be considered advanced, I'm not sure "giving a correct and meaningful answer for all values of expressly supported data types is "advanced" -- in a way, quite the opposite -- it's less "advanced" coders, ones that are not thinking about where NaNs might appear, and what the implication of that is, that are going to be bitten by the current implementation.

Docs can help, but I think we can, and should, do better than that -- after all it's well known that "no one reads documentation".
Which is EXACTLY the reason I say that if this is important enough to fix in median, it is important enough to fix in sorted. sorted gives exactly the same nonsense result, it is only a bit more obvious because it gives all the points. Is [3, nan, 1, 2, 4] a sorted list?

    One big reason to NOT fix the issue with NaNs in median is
    that such a fix likely has a measurable impact ction the
    processing of
    the median.


You mean performance? Sure, but as I've argued before (no idea if anyone agrees with me) the statistics package is already not a high performance package anyway. If it turns out that it slows it down by, say, a factor of two or more, then yes, maybe we need to forget it.

    I suspect that the simplest solution, and one that doesn't
    impact other uses would be simple filter functions (and perhaps
    median
    could be defined with a arguement for what function to use, with a
    None
    option that would be fairly quick. One filter would remove Nans (or
    None), one would throw an exception if there is a Nan, and another
    would
    just return the sequence [nan] if there are any NaNs in the input
    sequence (so the median would be nan). The same options could be
    added
    other operations like quantile which has the similar issue, and made
    available to the program for other use.


I agree -- this could be a good way to go.

    There is one other option that might be possible to fix sorted,


<snip> see the last post if you want the details ...

    This would say that sorted would work with NaNs, but for median most
    NaNs are treated as more positive than infinity, so the median is
    biased, but at least you don't get absurd results.


yeah, but this is what I meant above -- you'd still want to check for NaNs in the statistics functions. Though It would be a fast check, 'cause you could check only the ones on the end after sorting.

But the biggest barrier is that it would be a fair bit of churn on the sort() functions (and the float class), and would only help for floats anyway. If someone want to propose this, please do -- but I don't think we should wait for that to do something with the statistics module. Also, if you want to pursue this, do go back and find the thread about type-checked sorting -- I think this is it:

https://mail.python.org/pipermail/python-dev/2016-October/146613.html

I'm not sure if anything ever came of that.

- CHB

--
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/75BZ5UACRE6SWUJ4C4RYH2G6AQFKN7J3/
Code of Conduct: http://python.org/psf/codeofconduct/


--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GDTC6AD6YDWVHZGXPJLDBPXYMFYDMVHL/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to