On Fri, Feb 9, 2018 at 11:25 AM, Joerg Roedel <jroe...@suse.de> wrote:
> Ugh, okay. So I switch to movsl, that should at least perform on-par
> with the chain of 'pushl' instructions I had before.
It should generally be roughly in the same ballpark.
I think the instruction scheduling ends up basically breaking around
microcoded instructions, which is why you'll get something like 12+n
cycles for "rep movs" on some uarchs, but at that point it's probably
mostly in the noise compared to all the other nasty PTI things.
You won't see any of the _real_ advantages (which are about moving
cachelines at a time), so with smallish copies you really only see the
downsides of "rep movs", which is mainly that instruction scheduling
hickup with any miocrocode.
But with the iret and the cr3 movement, you aren't going to have a
nice well-behaved pipeline anyway.