On Fri, 26 Mar 2010 07:31:17 -0700, Steve Howell wrote: > From a purely academic standpoint, I'm not convinced that sum() is > inefficient in terms of big-O complexity, though. > > show...@showell-laptop:~$ python > Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on > linux2 > >>> class StupidList: [...]
But it's not *sum* that is inefficient, it is sum *with a particular data structure*. Sure, if you create your own data structure, you can make it as efficient as you like. Obviously the sum algorithm itself has to perform one addition per item, or O(N), which scales tolerably well. But each addition has a cost. If the cost is constant, then sum() as a whole remains O(N). But if the cost of addition varies with N, sum() degrades badly. We can compare the performance of sum with different data structures, starting with plain integers versus long integers: >>> from timeit import Timer >>> setup = 'data = [%d]*%d' >>> for i in range(6): ... t1 = Timer('sum(data, 0)', setup % (1, 10**i)) ... t2 = Timer('sum(data, 0)', setup % (10**50, 10**i)) ... print min(t1.repeat(number=1000)), ... print min(t2.repeat(number=1000)) ... 0.00179290771484 0.00263810157776 0.00340414047241 0.00854396820068 0.0190401077271 0.0502791404724 0.155302047729 0.645124912262 0.794432878494 2.55748295784 7.97877693176 25.3812758923 Both scale about as well, but the cost varies significantly: arithmetic on very large longints is expensive. Strings, with a trick to fool sum into accepting them, versus lists. Note that I changed the number of iterations from 6 down to 5. The reason why will become obvious: >>> class EmptyStringStarter: ... def __add__(self, other): ... return other ... >>> empty = EmptyStringStarter() >>> setup = """from __main__ import empty; data = [%r]*%d""" >>> >>> for i in range(5): ... t1 = Timer('sum(data, empty)', setup % ('a', 10**i)) ... t2 = Timer('sum(data, [])', setup % ([1], 10**i)) ... print min(t1.repeat(number=1000)), ... print min(t2.repeat(number=1000)) ... 0.00849103927612 0.00226998329163 0.0121459960938 0.0082700252533 0.0489149093628 0.186735153198 0.428920030594 5.28623914719 14.6552250385 589.102822065 Strings perform tolerably well, up to a point, but lists perform terribly. And in fact, the relatively good performance of strings is an artifact of recent versions of CPython. In Jython and IronPython, and older versions of CPython, it will behave as poorly as lists. > I wonder how severely sum(), without > the restriction, would underperform join() on modern versions of Python, > though. Take note that, in order to get an answer in reasonable time, I've reduced the number of timing iterations drastically: >>> for i in range(6): ... t1 = Timer('sum(data, empty)', setup % ('a', 10**i)) ... t2 = Timer('"".join(data)', setup % ('a', 10**i)) ... print min(t1.repeat(number=10)), ... print min(t2.repeat(number=10)) ... 8.89301300049e-05 1.09672546387e-05 0.000131845474243 2.19345092773e-05 0.000591993331909 9.29832458496e-05 0.0101289749146 0.00082802772522 0.365957021713 0.00884819030762 24.2072279453 0.0421011447906 Note the performance degradation of sum. It gets worse. Much worse: >>> for i in range(4, 7): ... t1 = Timer('sum(data, empty)', setup % ('a', 10**i)) ... t2 = Timer('"".join(data)', setup % ('a', 10**i)) ... print min(t1.repeat(number=1)), # note fewer iterations ... print min(t2.repeat(number=1)) ... 0.031229019165 0.000817060470581 2.45445990562 0.00365781784058 1024.79705095 0.0398509502411 This is absolutely catastrophic performance degradation. -- Steven -- http://mail.python.org/mailman/listinfo/python-list