(Summary of results: my patch at https://bugs.python.org/issue28685 makes list.sort() 30-50% faster in common cases, and at most 1.5% slower in the uncommon worst case.)
Hello all, You may remember seeing some messages on here about optimizing list.sort() by exploiting type-homogeneity: since comparing apples and oranges is uncommon (though possible, i.e. float to int), it pays off to check if the list is type-homogeneous (as well as homogeneous with respect to some other properties), and, if it is, to replace calls to PyObject_RichCompareBool with calls to ob_type->tp_richcompare (or, in common special cases, to optimized compare functions). The entire patch is less than 250 lines of code, most of which is pretty boilerplate (i.e. a lot of assertions in #ifdef Py_DEBUG blocks, etc). I originally wrote that patch back in November. I've learned a lot since then, both about CPython and about mailing list etiquette :). Previous discussion about this can be found at https://mail.python.org/pipermail/python-dev/2016-October/146648.html and https://mail.python.org/pipermail/python-ideas/2016-October/042807.html. Anyway, I recently redid the benchmarks much more rigorously (in preparation for presenting this project at my school's science fair), achieving a standard deviation of less than 0.5% of the mean for all measurements. The exact benchmark script used can be found at https://github.com/embg/python-fastsort-benchmark (it's just sorting random lists of/lists of tuples of [type]. While listsort.txt talks about benchmarking different kinds of structured lists, instead of just random lists, the results here would hold in those cases just as well, because this makes individual comparisons cheaper, instead of reducing the number of comparisons based on structure). I also made a poster describing the optimization and including a pretty graph displaying the benchmark data: https://github.com/embg/python-fastsort-benchmark/blob/master/poster.pdf. For those who would rather read the results here (though it is a *really* pretty graph): *** Percent improvement for sorting random lists of [type] (1-patched/unpatched): float: 48% bounded int (magnitude smaller than 2^32): 48.4% latin string (all characters in [0,255]): 32.7% general int (reasonably uncommon?): 17.2% general string (reasonably uncommon?): 9.2% tuples of float: 63.2% tuples of bounded int: 64.8% tuples of latin string: 55.8% tuples of general int: 50.3% tuples of general string: 44.1% tuples of heterogeneous: 41.5% heterogeneous (lots of float with an int at the end; worst-case): -1.5% *** Essentially, it's a gamble where the payoff is 20-30 times greater than the cost, and the odds of losing are very small. Sorting is perhaps not a bottleneck in most applications, but considering how much work has gone into Python's sort (Timsort, etc; half of listobject.c is sort code), I think it's interesting that list.sort() can be made essentially twice faster by a relatively simple optimization. I would also add that Python dictionaries already implement this optimization: they start out optimizing based on the assumption that they'll only be seeing string keys, checking to make sure that assumption holds as they go. If they see a non-string key, they permanently switch over to the general implementation. So it's really the same idea, except here it doesn't matter as much what type we're dealing with, which is important, because lists are commonly used with lots of different types, as opposed to dictionaries, which overwhelmingly commonly use string keys, especially internally. (Correct me if I'm wrong in any of the above). I got a lot of great feedback/input on this patch as I was writing it, but after submitting it, I didn't hear much from anybody. (The reason I took so long to post was because I wanted to wait until I had the chance to do the benchmarks *right*). What do you all think? Thanks, Elliot
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/