[Python-ideas] Re: Fix statistics.median()?

Richard Damon Sun, 29 Dec 2019 17:02:11 -0800

On 12/29/19 7:05 PM, Christopher Barker wrote:

On Sun, Dec 29, 2019 at 3:26 PM Richard Damon<rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:
    > Frankly, I’m also confused as to why folks seem to think this is an
    > issue to be addressed in the sort() functions

    The way I see it, is that median doesn't handle NaNs in a reasonable
    way, because sorted doesn't handle them,
I don't think so -- it doesn't handle NaNs because it takes a decisionabout how they should be handled, and code to write; maybe more codebecause you can't use the bare sort() functions, but sort will neversolve the problem both generically and properly by itself.

It doesn't handle NaNs because it decided to be a simple routine usingthe basic definition, the middle value based on the basic sort. I wouldexpect that the basic sort routine has a possibility that it has beenoptimized by dropping down parts into raw C for speed, while

    because it is easy and quick to
    not handle NaN, and to handle them you need to define an Official
    meaning for them, and there are multiple reasonable meanings.


exactly.

    The reason
    to push most solutions to sorted, is that except for ignore, which
    can
    easily be implemented as a data filter to the input of the
    function, the
    exact same problem occurs in multiple functions (in the statistics
    module, that would include quantile) so by the principle of DRY,
    that is
    the logical place to implement the solution (if we don't implement
    the
    solution as an input filter)
well, no -- the logical place for DRY is to use the SAME sortimplementation for all functions in the statistics module that need asort. It only makes sense to try to push this to the standard sort ifit were to be used, in the same way, but many other uses od sort, andit didn't break any current uses. ON the other hand, saying "this ishow the statistics module interprets NaNs, and how things will besorted" is a localized -- it does not require it be useful foranything else, and it will, by definition, not break any code thatdoesn't use the statistics module.
    At its beginning, the statistics module disclaims being a complete
    all
    encompassing statistics package,
sure -- but that doesn't mean it couldn't be more complete than itcurrently is.

If being more complete is 'simple', yes, but it doesn't look to be(unless the fix goes into sorted)

    and suggests using one if you need more
    advanced features, which I would consider most processing of NaN
    to be
    included in.
That's a perfectly valid opinion, but while I think that perhaps"handling missing values" could be considered advanced, I'm not sure"giving a correct and meaningful answer for all values of expresslysupported data types is "advanced" -- in a way, quite the opposite --it's less "advanced" coders, ones that are not thinking about whereNaNs might appear, and what the implication of that is, that are goingto be bitten by the current implementation.
Docs can help, but I think we can, and should, do better than that --after all it's well known that "no one reads documentation".

Which is EXACTLY the reason I say that if this is important enough tofix in median, it is important enough to fix in sorted. sorted givesexactly the same nonsense result, it is only a bit more obvious becauseit gives all the points. Is [3, nan, 1, 2, 4] a sorted list?

    One big reason to NOT fix the issue with NaNs in median is
    that such a fix likely has a measurable impact ction the
    processing of
    the median.
You mean performance? Sure, but as I've argued before (no idea ifanyone agrees with me) the statistics package is already not a highperformance package anyway. If it turns out that it slows it down by,say, a factor of two or more, then yes, maybe we need to forget it.
    I suspect that the simplest solution, and one that doesn't
    impact other uses would be simple filter functions (and perhaps
    median
    could be defined with a arguement for what function to use, with a
    None
    option that would be fairly quick. One filter would remove Nans (or
    None), one would throw an exception if there is a Nan, and another
    would
    just return the sequence [nan] if there are any NaNs in the input
    sequence (so the median would be nan). The same options could be
    added
    other operations like quantile which has the similar issue, and made
    available to the program for other use.


I agree -- this could be a good way to go.

    There is one other option that might be possible to fix sorted,


<snip> see the last post if you want the details ...

    This would say that sorted would work with NaNs, but for median most
    NaNs are treated as more positive than infinity, so the median is
    biased, but at least you don't get absurd results.
yeah, but this is what I meant above -- you'd still want to check forNaNs in the statistics functions. Though It would be a fast check,'cause you could check only the ones on the end after sorting.
But the biggest barrier is that it would be a fair bit of churn on thesort() functions (and the float class), and would only help for floatsanyway. If someone want to propose this, please do -- but I don'tthink we should wait for that to do something with the statisticsmodule. Also, if you want to pursue this, do go back and find thethread about type-checked sorting -- I think this is it:
https://mail.python.org/pipermail/python-dev/2016-October/146613.html

I'm not sure if anything ever came of that.

- CHB

--
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/75BZ5UACRE6SWUJ4C4RYH2G6AQFKN7J3/
Code of Conduct: http://python.org/psf/codeofconduct/



--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GDTC6AD6YDWVHZGXPJLDBPXYMFYDMVHL/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to