On 12/28/19 1:14 AM, Christopher Barker wrote:
On Fri, Dec 27, 2019 at 8:14 PM Richard Damon <rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:

    > It is a well known axiom of computing that returning an *incorrect*
    > result is a very bad thing.

    There is also an axiom that you can only expect valid results if you
    meet the operations pre-conditions.


sure.

    Sometimes, being totally defensive in checking for 'bad' inputs costs
    you too much performance.


it can, yes, there are no hard rules about anything.

    The stated requirement on the statistics module is you feed it
    'numbers', and a NaN is by definition Not a Number.


Sure, but NaN IS a part of the Python float type, and is can and will show up once in a while. That is not the same as expecting median (ir any other function is the module) to work with some arbitrary other type. Practicality beats purity -- floats WILL be widely used in teh module, they should be accommodated.

Here is the text in the docs:
"""
This module provides functions for calculating mathematical statistics of numeric (Real-valued) data.
<snip>
Unless explicitly noted, these functions support int, float, Decimal and Fraction. Behaviour with other types (whether in the numeric tower or not) is currently unsupported.
"""

So this is pretty clear - the module is designed to work with int, float, Decimal and Fraction -- so I think it's worth some effort to well-support those. And both float and Decimal have a NaN of some sort.

But the documentation that you reference say it works with NUMBERS, and NaN are explicitly NOT A NUMBER, so the statistic module specifically hasn't made a claim that it will work with them.

Also, the general section says, unless explicitly noted, and median makes an explicit reference to type that support order but not addition needing to use a different function, which implies that a data type the IS ordered and supports addition is usable with median (presumably some user defined number class that acts close enough to other number classes that the average of a and b is a value between them)


    The Median function also implies that its inputs have the property of
    having an order, as well as being able to be added (and if you
    can't add
    them, then you need to use median_lower or median_upper)


sure, but those are expected of the types above anyway. It doesn't seem to me that this package is designed to work with ANY type that has order and supports the operations needed by the function. For instance, lists of numbers can be compared, so:

In [69]: stuff = [[1,2,3],
    ...:          [4,6,1],
    ...:          [8,9],
    ...:          [4,5,6,7,8,9],
    ...:          [5.0]
    ...:          ]

In [70]:

In [70]: statistics.median(stuff)
Out[70]: [4, 6, 1]
Actually, since lists don't support addition in the manner requested, median isn't appropriate, but perhaps it has meaning with median_lower(). For example, if the lists represented hierarchical references in a document (Chapter 1, Section 2, Paragraph 3) then the median might have a meaning.
The fact that that worked is just a coincidence, it is not an important part of the design that it does.

    I will also point out that really the median function is a small
    part of
    the problem, all the x-tile based functions have the same issue,


absolutely, and pretty much all of them, though many will just give you NaN as a result, which is much better than an arbitrary result.
x-tile base functions (like median is 50th percentile, upper and lower quartile (25 and 75 percentile), the one provide in the module is quantiles (which actually can compute any even grouping).

    and
    fundamentally it is a problem with sorted().


I don't think so. In fact, sorted() is explicitly designed to work with any type that is consistently ordered (I'm not sure if they HAVE to be "total ordered), and floats with NaNs are not that.
Most sort routines require that the data be at least define a partial order and effectively define a total order based on considering that if a < b is false and b < a is false then a and b form an equivalency class that we don't care about order. Sets for instance with < being sub set of, have a consistent ordering but don't form a consistent equivalency class and thus don't sort properly.

As the statistics module is specifically designed to work with numbers (and essentially only with numbers) it's the appropriate place to put an accommodation for NaNs. Not to mention that while sorted() could be adapted to do something more consistent with NaNs, since it is a general purpose function, it's hard to know what the behavior should be -- raise an Exception? remove them? put them all at the front or back? Which makes sense depends entirely on the application. The statistics module, on the other hand, is for statistics, so while there are still multiple options, it's a lot easier to pick one and go for it.

    Has anyone tried to implement a version of these that checks for
    inputs
    that don't have a total order and seen what the performance impact is?


I think checking for total order is a red herring -- there really is a reason to specifically deal with NaNs in floats (and decimals), not ANY type that may not be total ordered.

Some of the problems is that fixing one issue doesn't come close to fixing the problem. The stated problem is that newbie/casual programmers get confused that some things don't work when you get NaNs in the mix. Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that 4 can't be the median of that array.

I am not saying that we can't fix the problem (though I think I have made a reasonable argument that it isn't a problem that MUST be fixed), but more that a change in just median is the wrong spot to make this fix. The real problem is that to the naive program when they do something wrong and get NaNs into their data, get confused at some of the strange answers that they can get. The fact that NaNs don't order is one of the points of confusion, and rather than try to fix one by one the various operations that are defined for sorted data to handle to issue, why not go to the core issue and deal with the base sorting operations. Perhaps even better would be a math mode (perhaps unfortunately needed to be the default since we are trying to help beginners) that isn't fully IEEE compliant but throws exceptions on the errors that get us into the territory that causes the confusion. This might actually not impact that many programs in real life, as how many programs actually need to be generating NaNs as a result of calculations.


    Testing for NaNs isn't trivial, as elsewhere it was pointed out
    that how
    you check is based on the type of the number you have (Decimals being
    different from floats).


yes, that is unfortunate.

    To be really complete, you want to actually
    detect that you have some elements that don't form a total order with
    the other elements.


as above, being "complete" isn't necessary here. And even if you were complete, what would you DO with a general non-total-ordered type? Raising an Exception would be OK, but any other option (like treat it as a missing value) wouldn't make sense if you doin't know more about the type.The
The only answer that I see that makes sense is Raising an Exception (likely in sorted). You also probably can't be totally thorough in checking, as checking completely associativity would be an N**3 operation which is way to slow. Likely you would live with just testing that for each pair you check has exactly one of a < b, a == b, b < a being true.

    In many ways, half fixing the issue makes it worse, as improving
    reliability can lead you to forget that you need to take care, so the
    remaining issues catch you harder.


I can't think of an example of this -- it's a fine principle, but if we were to make the entire statistics module do something reasonable with Nans -- what exactly other issues would that hide?

In short: practicality beats purity:

- The fact that the module is designed to work with all the standard number types doesn't mean it has to work with ANY type that supports the operations needed for a given function.

- NaNs are part of the Python float and Decimal implementations -- so they WILL show up once in a while. It would be good to handle them.

- NaNs can also be very helpful to indicate missing values -- this is can actually be very handy for statistics calculations. So it could be a nice feature to add -- that NaN means "missing value"

ASSUMING that NaNs represent missing data is just ONE possible interpretation for it. Making it THE interpretation in a low level package seems wrong. It also is an interpretation that is easy to create with a simple helper function that takes one sequence and returns another one where all the NaNs are removed. Other interpretations of what a NaN should reflect aren't as easy to implement outside the operation, so if you are going to pick an interpretation, it probably should be something else that is hard to handle externally.

Also, in Python, because it is dynamically typed, would be used better in my mind using something like None to indicate missing data in the array. NaN was chosen is some language as a missing value because they couldn't handle mixed type data arrays.


-CHB


--
Christopher Barker, PhD

-- h
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VRDAF4HV4GSTTHSK7NM5KOLRB3QPOO72/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to