This is sophistry. NaN is an instance of the abstract type numbers.Number and the concrete type float. IEEE-754 defines NaN as collection of required values in any floating point type.
I know the acronym suggests otherwise in a too-cute way, but NaN is archetypically a number in a computer science sense (but not in a pure math way, of course). Likewise, YAML (YAML Ain't Markup Language) is a markup language. And GNU (GNU's Not Unix) is a Unix system. On Sat, Dec 28, 2019, 8:30 PM Richard Damon <rich...@damon-family.org> wrote: > On 12/28/19 1:14 AM, Christopher Barker wrote: > > On Fri, Dec 27, 2019 at 8:14 PM Richard Damon > > <rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote: > > > > > It is a well known axiom of computing that returning an *incorrect* > > > result is a very bad thing. > > > > There is also an axiom that you can only expect valid results if you > > meet the operations pre-conditions. > > > > > > sure. > > > > Sometimes, being totally defensive in checking for 'bad' inputs costs > > you too much performance. > > > > > > it can, yes, there are no hard rules about anything. > > > > The stated requirement on the statistics module is you feed it > > 'numbers', and a NaN is by definition Not a Number. > > > > > > Sure, but NaN IS a part of the Python float type, and is can and will > > show up once in a while. That is not the same as expecting median (ir > > any other function is the module) to work with some arbitrary other > > type. Practicality beats purity -- floats WILL be widely used in teh > > module, they should be accommodated. > > > > Here is the text in the docs: > > """ > > This module provides functions for calculating mathematical statistics > > of numeric (Real-valued) data. > > <snip> > > Unless explicitly noted, these functions support int, float, Decimal > > and Fraction. Behaviour with other types (whether in the numeric tower > > or not) is currently unsupported. > > """ > > > > So this is pretty clear - the module is designed to work with int, > > float, Decimal and Fraction -- so I think it's worth some effort to > > well-support those. And both float and Decimal have a NaN of some sort. > > But the documentation that you reference say it works with NUMBERS, and > NaN are explicitly NOT A NUMBER, so the statistic module specifically > hasn't made a claim that it will work with them. > > Also, the general section says, unless explicitly noted, and median > makes an explicit reference to type that support order but not addition > needing to use a different function, which implies that a data type the > IS ordered and supports addition is usable with median (presumably some > user defined number class that acts close enough to other number classes > that the average of a and b is a value between them) > > > > > The Median function also implies that its inputs have the property of > > having an order, as well as being able to be added (and if you > > can't add > > them, then you need to use median_lower or median_upper) > > > > > > sure, but those are expected of the types above anyway. It doesn't > > seem to me that this package is designed to work with ANY type that > > has order and supports the operations needed by the function. For > > instance, lists of numbers can be compared, so: > > > > In [69]: stuff = [[1,2,3], > > ...: [4,6,1], > > ...: [8,9], > > ...: [4,5,6,7,8,9], > > ...: [5.0] > > ...: ] > > > > In [70]: > > > > In [70]: statistics.median(stuff) > > Out[70]: [4, 6, 1] > Actually, since lists don't support addition in the manner requested, > median isn't appropriate, but perhaps it has meaning with > median_lower(). For example, if the lists represented hierarchical > references in a document (Chapter 1, Section 2, Paragraph 3) then the > median might have a meaning. > > The fact that that worked is just a coincidence, it is not an > > important part of the design that it does. > > > > I will also point out that really the median function is a small > > part of > > the problem, all the x-tile based functions have the same issue, > > > > > > absolutely, and pretty much all of them, though many will just give > > you NaN as a result, which is much better than an arbitrary result. > x-tile base functions (like median is 50th percentile, upper and lower > quartile (25 and 75 percentile), the one provide in the module is > quantiles (which actually can compute any even grouping). > > > > and > > fundamentally it is a problem with sorted(). > > > > > > I don't think so. In fact, sorted() is explicitly designed to work > > with any type that is consistently ordered (I'm not sure if they HAVE > > to be "total ordered), and floats with NaNs are not that. > Most sort routines require that the data be at least define a partial > order and effectively define a total order based on considering that if > a < b is false and b < a is false then a and b form an equivalency class > that we don't care about order. Sets for instance with < being sub set > of, have a consistent ordering but don't form a consistent equivalency > class and thus don't sort properly. > > > > As the statistics module is specifically designed to work with numbers > > (and essentially only with numbers) it's the appropriate place to put > > an accommodation for NaNs. Not to mention that while sorted() could be > > adapted to do something more consistent with NaNs, since it is a > > general purpose function, it's hard to know what the behavior should > > be -- raise an Exception? remove them? put them all at the front or > > back? Which makes sense depends entirely on the application. The > > statistics module, on the other hand, is for statistics, so while > > there are still multiple options, it's a lot easier to pick one and go > > for it. > > > > Has anyone tried to implement a version of these that checks for > > inputs > > that don't have a total order and seen what the performance impact > is? > > > > > > I think checking for total order is a red herring -- there really is a > > reason to specifically deal with NaNs in floats (and decimals), not > > ANY type that may not be total ordered. > > Some of the problems is that fixing one issue doesn't come close to > fixing the problem. The stated problem is that newbie/casual programmers > get confused that some things don't work when you get NaNs in the mix. > Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that > 4 can't be the median of that array. > > I am not saying that we can't fix the problem (though I think I have > made a reasonable argument that it isn't a problem that MUST be fixed), > but more that a change in just median is the wrong spot to make this > fix. The real problem is that to the naive program when they do > something wrong and get NaNs into their data, get confused at some of > the strange answers that they can get. The fact that NaNs don't order is > one of the points of confusion, and rather than try to fix one by one > the various operations that are defined for sorted data to handle to > issue, why not go to the core issue and deal with the base sorting > operations. Perhaps even better would be a math mode (perhaps > unfortunately needed to be the default since we are trying to help > beginners) that isn't fully IEEE compliant but throws exceptions on the > errors that get us into the territory that causes the confusion. This > might actually not impact that many programs in real life, as how many > programs actually need to be generating NaNs as a result of calculations. > > > > > Testing for NaNs isn't trivial, as elsewhere it was pointed out > > that how > > you check is based on the type of the number you have (Decimals being > > different from floats). > > > > > > yes, that is unfortunate. > > > > To be really complete, you want to actually > > detect that you have some elements that don't form a total order with > > the other elements. > > > > > > as above, being "complete" isn't necessary here. And even if you were > > complete, what would you DO with a general non-total-ordered type? > > Raising an Exception would be OK, but any other option (like treat it > > as a missing value) wouldn't make sense if you doin't know more about > > the type.The > The only answer that I see that makes sense is Raising an Exception > (likely in sorted). You also probably can't be totally thorough in > checking, as checking completely associativity would be an N**3 > operation which is way to slow. Likely you would live with just testing > that for each pair you check has exactly one of a < b, a == b, b < a > being true. > > > > In many ways, half fixing the issue makes it worse, as improving > > reliability can lead you to forget that you need to take care, so the > > remaining issues catch you harder. > > > > > > I can't think of an example of this -- it's a fine principle, but if > > we were to make the entire statistics module do something reasonable > > with Nans -- what exactly other issues would that hide? > > > > In short: practicality beats purity: > > > > - The fact that the module is designed to work with all the standard > > number types doesn't mean it has to work with ANY type that supports > > the operations needed for a given function. > > > > - NaNs are part of the Python float and Decimal implementations -- so > > they WILL show up once in a while. It would be good to handle them. > > > > - NaNs can also be very helpful to indicate missing values -- this is > > can actually be very handy for statistics calculations. So it could be > > a nice feature to add -- that NaN means "missing value" > > ASSUMING that NaNs represent missing data is just ONE possible > interpretation for it. Making it THE interpretation in a low level > package seems wrong. It also is an interpretation that is easy to create > with a simple helper function that takes one sequence and returns > another one where all the NaNs are removed. Other interpretations of > what a NaN should reflect aren't as easy to implement outside the > operation, so if you are going to pick an interpretation, it probably > should be something else that is hard to handle externally. > > Also, in Python, because it is dynamically typed, would be used better > in my mind using something like None to indicate missing data in the > array. NaN was chosen is some language as a missing value because they > couldn't handle mixed type data arrays. > > > > > -CHB > > > > > > -- > > Christopher Barker, PhD > > -- h > Richard Damon > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/VRDAF4HV4GSTTHSK7NM5KOLRB3QPOO72/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ETJNQQMHO25SSWX4HBPKHAJAQK4CVWSV/ Code of Conduct: http://python.org/psf/codeofconduct/