On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.


rep movsd is very CPU dependend and needs some precondtions to be fast. For relative short memory blocks it sucks on many other CPU than the last Intel.

See what Agner Fog has to say about it:

16.10
String instructions (all processors)
String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to LOOP on all processors and to JECXZ on some processors. REP MOVSD andREP STOSD are quite fast if the repeat count is not too small. Always use the largest word size possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make sure that both source and destination are aligned by the word size. In many cases, however, it is faster to use XMM registers. Moving data in XMM registers is faster than REP MOVSD and REP STOSD in most cases, especially on older processors. See page 164 for details. Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses on P2 and P3. In other words, you will get a penalty of one clock extra per iteration if ESI +WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16-bit mode. On many processors, REP MOVS and REP STOS can perform fast by moving 16 bytes or an entire cache line at a time . This happens only when certain conditions are met. Depending on the processor, the conditions for fast string instructions are, typically, that the count must be high, both source and destination must be aligned, the direction must be forward, the distance between source and destination must be at least the cache line size, and the memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met). Under these conditions, the speed is as high as you can obtain with vector register moves or even faster on some processors. While the string instructions can be quite convenient, it must be emphasized that other solutions are faster in many cases. If the above conditions for fast move are not met then there is a lot to gain by using other methods. See page 164 for alternatives to REP MOVS


Reply via email to