Hi Brian,

   Thanks for replying.

   Is there a way to simplify to rep movsb (or movsw since the array is 
uint16_t) without using assembly code?
The code is currently 100% C and I would prefer to avoid having a mix of C and 
assembly code.  Also, it is unclear to me that even repsw would be faster than 
my bloated C code that generates assembly code that does 8 word moves in 5 
instructions:

.L25:
movdqu -16(%rax), %xmm1
subq $16, %rax
movups %xmm1, 2(%rax)
cmpq %rdx, %rax
jnb .L25

   I am currently evaluating the stash and move method for the uint16_t data at 
the start that can't be moved in 16 byte chunks.
It uses two extra 64 bit registers but that may be better than having the 
compiler move addresses into registers for a memmove
call that moves the last 2 - 14 bytes (which is what the compiler does).  I 
didn't think of that option until looking at the
memmove assembly code.

   I thought alignment might be an issue, but noticed that the memmove assembly 
code does not perform alignment.  It first checks
the number of bytes to move.  If the number of bytes to move is less than 8 it 
jumps to the movsb section.  If the length is
8 or more it stashes the highest address bytes that need to be moved. Then it 
moves the data with rep movsq starting at the
beginning or near the end (for backward moves) of the array, moving the data 
until the remaining length is less than 8 bytes
(or 0 bytes for backward moves).  Then it uses the stashed data to finish the 
move.  The addresses for the rep movsq could have
any alignment that is consistent with the alignment of the data being moved.  
Since this code is moving uint16_t's, the alignment is only 2 byte alignment 
for the rep movsq.  At least if I am reading the assembly code correctly....

Best Regards,

Kennon



  
> On 02/27/2026 11:49 AM PST Brian Inglis via Cygwin <[email protected]> wrote:
> 
>  
> Hi Kennon,
> 
> Some perf reports and analysis imply that backward moves (with overlap?) are 
> no 
> faster than straight rep movsb on some CPUs, so it may be better to just 
> simplify to that, unless you want to stash the final element(s) to be moved 
> out 
> of the way in register(s), and use multiple registers in unrolled wide moves 
> for 
> the aligned portion?
>

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to