[Python-ideas] Re: Fix statistics.median()?

David Mertz Sat, 28 Dec 2019 17:44:13 -0800

This is sophistry. NaN is an instance of the abstract type numbers.Number
and the concrete type float. IEEE-754 defines NaN as collection of required
values in any floating point type.


I know the acronym suggests otherwise in a too-cute way, but NaN is
archetypically a number in a computer science sense (but not in a pure math
way, of course).

Likewise, YAML (YAML Ain't Markup Language) is a markup language. And GNU
(GNU's Not Unix) is a Unix system.

On Sat, Dec 28, 2019, 8:30 PM Richard Damon <rich...@damon-family.org>
wrote:

> On 12/28/19 1:14 AM, Christopher Barker wrote:
> > On Fri, Dec 27, 2019 at 8:14 PM Richard Damon
> > <rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:
> >
> >     > It is a well known axiom of computing that returning an *incorrect*
> >     > result is a very bad thing.
> >
> >     There is also an axiom that you can only expect valid results if you
> >     meet the operations pre-conditions.
> >
> >
> > sure.
> >
> >     Sometimes, being totally defensive in checking for 'bad' inputs costs
> >     you too much performance.
> >
> >
> > it can, yes, there are no hard rules about anything.
> >
> >     The stated requirement on the statistics module is you feed it
> >     'numbers', and a NaN is by definition Not a Number.
> >
> >
> > Sure, but NaN IS a part of the Python float type, and is can and will
> > show up once in a while. That is not the same as expecting median (ir
> > any other function is the module) to work with some arbitrary other
> > type. Practicality beats purity -- floats WILL be widely used in teh
> > module, they should be accommodated.
> >
> > Here is the text in the docs:
> > """
> > This module provides functions for calculating mathematical statistics
> > of numeric (Real-valued) data.
> > <snip>
> > Unless explicitly noted, these functions support int, float, Decimal
> > and Fraction. Behaviour with other types (whether in the numeric tower
> > or not) is currently unsupported.
> > """
> >
> > So this is pretty clear - the module is designed to work with int,
> > float, Decimal and Fraction -- so I think it's worth some effort to
> > well-support those. And both float and Decimal have a NaN of some sort.
>
> But the documentation that you reference say it works with NUMBERS, and
> NaN are explicitly NOT A NUMBER, so the statistic module specifically
> hasn't made a claim that it will work with them.
>
> Also, the general section says, unless explicitly noted, and median
> makes an explicit reference to type that support order but not addition
> needing to use a different function, which implies that a data type the
> IS ordered and supports addition is usable with median (presumably some
> user defined number class that acts close enough to other number classes
> that the average of a and b is a value between them)
>
> >
> >     The Median function also implies that its inputs have the property of
> >     having an order, as well as being able to be added (and if you
> >     can't add
> >     them, then you need to use median_lower or median_upper)
> >
> >
> > sure, but those are expected of the types above anyway. It doesn't
> > seem to me that this package is designed to work with ANY type that
> > has order and supports the operations needed by the function. For
> > instance, lists of numbers can be compared, so:
> >
> > In [69]: stuff = [[1,2,3],
> >     ...:          [4,6,1],
> >     ...:          [8,9],
> >     ...:          [4,5,6,7,8,9],
> >     ...:          [5.0]
> >     ...:          ]
> >
> > In [70]:
> >
> > In [70]: statistics.median(stuff)
> > Out[70]: [4, 6, 1]
> Actually, since lists don't support addition in the manner requested,
> median isn't appropriate, but perhaps it has meaning with
> median_lower(). For example, if the lists represented hierarchical
> references in a document (Chapter 1, Section 2, Paragraph 3) then the
> median might have a meaning.
> > The fact that that worked is just a coincidence, it is not an
> > important part of the design that it does.
> >
> >     I will also point out that really the median function is a small
> >     part of
> >     the problem, all the x-tile based functions have the same issue,
> >
> >
> > absolutely, and pretty much all of them, though many will just give
> > you NaN as a result, which is much better than an arbitrary result.
> x-tile base functions (like median is 50th percentile, upper and lower
> quartile (25 and 75 percentile), the one provide in the module is
> quantiles (which actually can compute any even grouping).
> >
> >     and
> >     fundamentally it is a problem with sorted().
> >
> >
> > I don't think so. In fact, sorted() is explicitly designed to work
> > with any type that is consistently ordered (I'm not sure if they HAVE
> > to be "total ordered), and floats with NaNs are not that.
> Most sort routines require that the data be at least define a partial
> order and effectively define a total order based on considering that if
> a < b is false and b < a is false then a and b form an equivalency class
> that we don't care about order. Sets for instance with < being sub set
> of, have a consistent ordering but don't form a consistent equivalency
> class and thus don't sort properly.
> >
> > As the statistics module is specifically designed to work with numbers
> > (and essentially only with numbers) it's the appropriate place to put
> > an accommodation for NaNs. Not to mention that while sorted() could be
> > adapted to do something more consistent with NaNs, since it is a
> > general purpose function, it's hard to know what the behavior should
> > be -- raise an Exception? remove them? put them all at the front or
> > back? Which makes sense depends entirely on the application. The
> > statistics module, on the other hand, is for statistics, so while
> > there are still multiple options, it's a lot easier to pick one and go
> > for it.
> >
> >     Has anyone tried to implement a version of these that checks for
> >     inputs
> >     that don't have a total order and seen what the performance impact
> is?
> >
> >
> > I think checking for total order is a red herring -- there really is a
> > reason to specifically deal with NaNs in floats (and decimals), not
> > ANY type that may not be total ordered.
>
> Some of the problems is that fixing one issue doesn't come close to
> fixing the problem. The stated problem is that newbie/casual programmers
> get confused that some things don't work when you get NaNs in the mix.
> Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that
> 4 can't be the median of that array.
>
> I am not saying that we can't fix the problem (though I think I have
> made a reasonable argument that it isn't a problem that MUST be fixed),
> but more that a change in just median is the wrong spot to make this
> fix. The real problem is that to the naive program when they do
> something wrong and get NaNs into their data, get confused at some of
> the strange answers that they can get. The fact that NaNs don't order is
> one of the points of confusion, and rather than try to fix one by one
> the various operations that are defined for sorted data to handle to
> issue, why not go to the core issue and deal with the base sorting
> operations. Perhaps even better would be a math mode (perhaps
> unfortunately needed to be the default since we are trying to help
> beginners) that isn't fully IEEE compliant but throws exceptions on the
> errors that get us into the territory that causes the confusion. This
> might actually not impact that many programs in real life, as how many
> programs actually need to be generating NaNs as a result of calculations.
>
> >
> >     Testing for NaNs isn't trivial, as elsewhere it was pointed out
> >     that how
> >     you check is based on the type of the number you have (Decimals being
> >     different from floats).
> >
> >
> > yes, that is unfortunate.
> >
> >     To be really complete, you want to actually
> >     detect that you have some elements that don't form a total order with
> >     the other elements.
> >
> >
> > as above, being "complete" isn't necessary here. And even if you were
> > complete, what would you DO with a general non-total-ordered type?
> > Raising an Exception would be OK, but any other option (like treat it
> > as a missing value) wouldn't make sense if you doin't know more about
> > the type.The
> The only answer that I see that makes sense is Raising an Exception
> (likely in sorted). You also probably can't be totally thorough in
> checking, as checking completely associativity would be an N**3
> operation which is way to slow. Likely you would live with just testing
> that for each pair you check has exactly one of a < b, a == b, b < a
> being true.
> >
> >     In many ways, half fixing the issue makes it worse, as improving
> >     reliability can lead you to forget that you need to take care, so the
> >     remaining issues catch you harder.
> >
> >
> > I can't think of an example of this -- it's a fine principle, but if
> > we were to make the entire statistics module do something reasonable
> > with Nans -- what exactly other issues would that hide?
> >
> > In short: practicality beats purity:
> >
> > - The fact that the module is designed to work with all the standard
> > number types doesn't mean it has to work with ANY type that supports
> > the operations needed for a given function.
> >
> > - NaNs are part of the Python float and Decimal implementations -- so
> > they WILL show up once in a while. It would be good to handle them.
> >
> > - NaNs can also be very helpful to indicate missing values -- this is
> > can actually be very handy for statistics calculations. So it could be
> > a nice feature to add -- that NaN means "missing value"
>
> ASSUMING that NaNs represent missing data is just ONE possible
> interpretation for it. Making it THE interpretation in a low level
> package seems wrong. It also is an interpretation that is easy to create
> with a simple helper function that takes one sequence and returns
> another one where all the NaNs are removed. Other interpretations of
> what a NaN should reflect aren't as easy to implement outside the
> operation, so if you are going to pick an interpretation, it probably
> should be something else that is hard to handle externally.
>
> Also, in Python, because it is dynamically typed, would be used better
> in my mind using something like None to indicate missing data in the
> array. NaN was chosen is some language as a missing value because they
> couldn't handle mixed type data arrays.
>
> >
> > -CHB
> >
> >
> > --
> > Christopher Barker, PhD
>
> -- h
> Richard Damon
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/VRDAF4HV4GSTTHSK7NM5KOLRB3QPOO72/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ETJNQQMHO25SSWX4HBPKHAJAQK4CVWSV/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to