On Fri, Feb 09, 2018 at 11:17:35AM -0800, Linus Torvalds wrote: > Yeah, it's only true on the very latest uarchs, and even there it's > not perfect for small copies. > > On the older machines that are relevant for 32-bit code, it's often > tens of cycles just for the ucode overhead, I think, and "rep movsb" > actually does things literally a byte at a time.
Ugh, okay. So I switch to movsl, that should at least perform on-par with the chain of 'pushl' instructions I had before. Thanks, Joerg