On Thu, 15 Sep 2016 08:02:10 -0700 Raymond Hettinger <raymond.hettin...@gmail.com> wrote: > > Eric is correct on this one. The consecutive hashes make a huge difference > for Python 3.5. While there is a table full table scan, the check for NULL > entries becomes a predictable branch when all the keys are in consecutive > positions. There is an astonishingly well written stack overflow post that > explains this effect clearly: http://stackoverflow.com/questions/11227809 > > With normal randomized keys, Python 3.6 loop is dramatically better that > Python 3.5: [...]
You're jumping to conclusions. While there is a difference, there is no evidence that the difference is due to better branch prediction. Actually, let's do a quick back-of-the-envelope calculation and show that it can't be attributed mostly to branch prediction: > ~/py36 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" > "list(d)" > 100 loops, best of 3: 12.3 msec per loop > ~/py35 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" > "list(d)" > 10 loops, best of 3: 54.7 msec per loop For 10**6 elements, this is a 42ns difference per dict element. A 2.6 Ghz Haswell doesn't stall for 42ns when there's a branch mispredict. According to the Internet, the branch mispredict penalty for a Haswell CPU is 15 cycles, which is 5.7ns at 2.6 GHz. Far from the observed 42ns. 42ns, however, is congruent with another possible effect: a main memory access following a last-level cache miss. And indeed, Serhiy showed that this micro-benchmark is actually dependent on insertion order *on Python 3.6*: $ ./python -m timeit -s "l = [str(i) for i in range(10**6)]; d=dict.fromkeys(l)" "list(d)" -> 100 loops, best of 3: 20 msec per loop $ ./python -m timeit -s "import random; l = [str(i) for i in range(10**6)]; random.shuffle(l); d=dict.fromkeys(l)" "list(d)" -> 10 loops, best of 3: 55.8 msec per loop The only way the table scan (without NULL checks, since this is Python 3.6) can be dependent on insertion order is because iterating the table elements needs to INCREF each element, that is: this benchmark doesn't only scan the table in a nice prefetcher-friendly linear sequence, it also accesses object headers at arbitrary places in Python's heap memory. Since this micro-benchmark creates the keys in order just before filling the dict with them, randomizing the insertion order destroys the temporal locality of object header accesses when iterating over the dict keys. *This* looks like the right explanation, not branch mispredicts due to NULL checks. This also shows that a micro-benchmark that merely looks ok can actually be a terrible proxy of actual performance. As a further validation of this theory, let's dramatically decrease the working set size on the initial benchmark: $ ./python -m timeit -s "d=dict.fromkeys(map(str,range(10**3)))" "list(d)" -> Python 3.5: 100000 loops, best of 3: 10.9 usec per loop -> Python 3.6: 100000 loops, best of 3: 9.72 usec per loop When the working set fits in the cache, this micro-benchmark is only 12% slower on 3.5 compared to 3.6. *This* much smaller difference (a mere 1.2ns difference per dict element) could be attributed to eliminating the NULL checks, or to any other streamlining of the core iteration logic. Regards Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com