[Python-ideas] Re: Fix statistics.median()?

David Mertz Thu, 26 Dec 2019 13:36:45 -0800

FWIW, here is a timing:

>>> many_nums = [randint(10, 100) for _ in range(1_000_000)]
>>> %timeit statistics.median_low(many_nums)
87.2 ms ± 654 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit median(many_nums)
282 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


I think almost all the slowdown is because `sorted()` is a C function.  In
big-O terms, mine should be an improvement since it does part of a
Quicksort in partitioning elements, but it doesn't actually bother sorting
the smaller partition.  It *does* make one pass through to find the min or
max though.

I'm pretty sure that even in plain-Python, someone smarter than me, like
Tim Peters (or others who watch here), could speed up my version by 2x.


On Thu, Dec 26, 2019 at 4:13 PM David Mertz <[email protected]> wrote:

> Here is an implementation that:
>
> A. Only relies on '<'
> B. Has no special knowledge of NaN
> C. Does not use sorted() or .sort() for anything
> D. Is Pandas-like in choosing among only comparable items
> E. Only picks an element from the original iterator, does not take mean of
> two candidates
>     (I.e. kinda like statistic.median_low() without the NaN bug)
> F. Is barely tested and not-at-all benchmarked.  I'm sure Timsort is way
> faster than this kinda-like-quicksort approach.
>
> >>> even_names = ['Bob', 'Alice', 'Charlie', 'Zack', 'Xavier', 'Mary']
> >>> odd_names = names + ['Francis']
> >>> nums = [2, 3, 4, nan, 1]
> >>> %paste
> def median(it, *, _offset=0):
>     smalls = []
>     bigs = []
>     go_big = False
>     it = iter(it)
>     candidate = next(it)
>     for x in it:
>         if x < candidate:
>             smalls.append(x)
>         elif candidate < x:
>             bigs.append(x)
>         elif candidate == x:
>             if go_big:
>                 bigs.append(x)
>             else:
>                 smalls.append(x)
>             go_big = not go_big
>
>     if abs(len(smalls) - len(bigs) + _offset) <= 1:
>         return candidate
>     elif len(smalls) >= len(bigs):
>         offset = -len(bigs)
>         if bigs:
>             smalls.append(min(bigs))
>         return median(smalls, _offset=offset)
>     else:
>         offset = len(smalls)
>         if smalls:
>             bigs.append(max(smalls))
>         return median(bigs, _offset=offset)
>
> ## -- End pasted text --
> >>> statistics.median_low(odd_names), median(odd_names)
> ('Francis', 'Francis')
> >>> statistics.median_low(even_names), median(even_names)
> ('Charlie', 'Charlie')
> >>> statistics.median_low(nums), median(nums)
> (4, 2)
>
>
> --
> Keeping medicines from the bloodstreams of the sick; food
> from the bellies of the hungry; books from the hands of the
> uneducated; technology from the underdeveloped; and putting
> advocates of freedom in prisons.  Intellectual property is
> to the 21st century what the slave trade was to the 16th.
>


-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/ZNM2Q3FIG23EPPKI2SLHC6LLRNS7VKFW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to