Extrarius <filte...@psychosanity.com> added the comment:

If you are using rdtsc for "timing", are you also using a serializing
instruction immediately beforehand (cpuid being the most commonly used) to
ensure that all prior operations are complete? If not, you can get negative
"deltas" between two calls to rdtsc due to the complexity of modern
architectures (instruction reordering, multiple execution ports, etc).

A big downside to using cpuid is that it can be extremely slow inside virtual
machines (being one of the commonly virtualized instructions). An alternative
that is almost as good is putting an lfence immediately before the rdtsc - it
doesn't serialize memory stores, but it kind of serialize instruction execution
(according to intel docs, all instructions preceeding lfence must complete
before lfence will execute). I've tried both in some profiling software I've
done, and outside of VMs there was no difference, but inside them, the
lfence/rdtsc version executed far faster (by wall-clock time. both versions
report similar cycle counts, but using cpuid/rdtsc inside a vm made it take
anywhere from 2 to 10 times as long depending on the size of the profiled code)

----------
nosy: +Extrarius

________________________________________
PyPy bug tracker <trac...@bugs.pypy.org>
<https://bugs.pypy.org/issue900>
________________________________________
_______________________________________________
pypy-issue mailing list
pypy-issue@python.org
https://mail.python.org/mailman/listinfo/pypy-issue

Reply via email to