On Fri, Feb 09, 2018 at 11:17:35AM -0800, Linus Torvalds wrote:
> Yeah, it's only true on the very latest uarchs, and even there it's
> not perfect for small copies.
> On the older machines that are relevant for 32-bit code, it's often
> tens of cycles just for the ucode overhead, I think, and "rep movsb"
> actually does things literally a byte at a time.
Ugh, okay. So I switch to movsl, that should at least perform on-par
with the chain of 'pushl' instructions I had before.