Extrarius <filte...@psychosanity.com> added the comment: If you are using rdtsc for "timing", are you also using a serializing instruction immediately beforehand (cpuid being the most commonly used) to ensure that all prior operations are complete? If not, you can get negative "deltas" between two calls to rdtsc due to the complexity of modern architectures (instruction reordering, multiple execution ports, etc).
A big downside to using cpuid is that it can be extremely slow inside virtual machines (being one of the commonly virtualized instructions). An alternative that is almost as good is putting an lfence immediately before the rdtsc - it doesn't serialize memory stores, but it kind of serialize instruction execution (according to intel docs, all instructions preceeding lfence must complete before lfence will execute). I've tried both in some profiling software I've done, and outside of VMs there was no difference, but inside them, the lfence/rdtsc version executed far faster (by wall-clock time. both versions report similar cycle counts, but using cpuid/rdtsc inside a vm made it take anywhere from 2 to 10 times as long depending on the size of the profiled code) ---------- nosy: +Extrarius ________________________________________ PyPy bug tracker <trac...@bugs.pypy.org> <https://bugs.pypy.org/issue900> ________________________________________ _______________________________________________ pypy-issue mailing list pypy-issue@python.org https://mail.python.org/mailman/listinfo/pypy-issue