Re: [Python-Dev] PEP 450 adding statistics module

Steven D'Aprano Thu, 15 Aug 2013 06:12:18 -0700

On 15/08/13 21:42, Mark Dickinson wrote:

The PEP and code look generally good to me.


I think the API for median and its variants deserves some wider discussion:
the reference implementation has a callable 'median', and variant callables
'median.low', 'median.high', 'median.grouped'.  The pattern of attaching
the variant callables as attributes on the main callable is unusual, and
isn't something I've seen elsewhere in the standard library.  I'd like to
see some explanation in the PEP for why it's done this way.  (There was
already some discussion of this on the issue, but that was more centered
around the implementation than the API.)

I'd propose two alternatives for this:  either have separate functions
'median', 'median_low', 'median_high', etc., or have a single function
'median' with a "method" argument that takes a string specifying
computation using a particular method.  I don't see a really good reason to
deviate from standard patterns here, and fear that users would find the
current API surprising.


Alexander Belopolsky has convinced me (off-list) that my current implementation 
is better changed to a more conservative one of a callable singleton instance 
with methods implementing the alternative computations. I'll have something 
like:


def _singleton(cls):
    return cls()


@_singleton
class median:
    def __call__(self, data):
        ...
    def low(self, data):
        ...
    ...


In my earlier stats module, I had a single median function that took a argument to choose 
between alternatives. I called it "scheme":

median(data, scheme="low")

R uses parameter called "type" to choose between alternate calculations, not 
for median as we are discussing, but for quantiles:

quantile(x, probs ... type = 7, ...).

SAS also uses a similar system, but with different numeric codes. I rejected both "type" 
and "method" as the parameter name since it would cause confusion with the usual meanings 
of those words. I eventually decided against this system for two reasons:

- Each scheme ended up needing to be a separate function, for ease of both implementation 
and testing. So I had four private median functions, which I put inside a class to act as 
namespace and avoid polluting the main namespace. Then I needed a "master 
function" to select which of the methods should be called, with all the additional 
testing and documentation that entailed.

- The API doesn't really feel very Pythonic to me. For example, we write:

mystring.rjust(width)
dict.items()

rather than mystring.justify(width, "right") or dict.iterate("items"). So I 
think individual methods is a better API, and one which is more familiar to most Python users. The 
only innovation (if that's what it is) is to have median a callable object.


As far as having four separate functions, median, median_low, etc., it just 
doesn't feel right to me. It puts four slight variations of the same function 
into the main namespace, instead of keeping them together in a namespace. Names 
like median_low merely simulates a namespace with pseudo-methods separated with 
underscores instead of dots, only without the advantages of a real namespace.

(I treat variance and std dev differently, and make the sample and population 
forms separate top-level functions rather than methods, simply because they are 
so well-known from scientific calculators that it is unthinkable to me to do 
differently. Whenever I use numpy, I am surprised all over again that it has 
only a single variance function.)



--
Steven
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 450 adding statistics module

Reply via email to