Re: Replacing C's memcpy with a D implementation

Patrick Schluter via Digitalmars-d Sun, 10 Jun 2018 21:46:22 -0700

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

rep movsd is very CPU dependend and needs some precondtions to befast. For relative short memory blocks it sucks on many other CPUthan the last Intel.


See what Agner Fog has to say about it:

16.10
String instructions (all processors)

String instructions without a repeat prefix are too slow andshould be replaced by simpler instructions. The same applies toLOOP on all processors and to JECXZon some processors. REP MOVSD andREP STOSD are quite fast if therepeat count is not too small. Always use the largest word sizepossible (DWORDin 32-bit mode, QWORD in 64-bit mode), and makesure that both source and destination are aligned by the wordsize. In many cases, however, it is faster to use XMM registers.Moving data in XMM registers is faster than REP MOVSD and REPSTOSDin most cases, especially on older processors. See page 164 fordetails.Note that while the REP MOVS instruction writes a word to thedestination, it reads the next word from the source in the sameclock cycle. You can have a cache bank conflict if bit 2-4 arethe same in these two addresses on P2 and P3. In other words, youwill get a penalty of one clock extra per iteration if ESI+WORDSIZE-EDI is divisible by 32. The easiest way to avoid cachebank conflicts is to align both source and destination by 8.Never use MOVSB or MOVSWin optimized code, not even in 16-bit mode. On many processors,REP MOVS and REP STOS can perform fast by moving 16 bytes or anentire cache line at a time. This happens only when certain conditions are met. Depending onthe processor, the conditions for fast string instructions are,typically, that the count mustbe high, both source and destination must be aligned, thedirection must be forward, the distance between source anddestination must be at least the cache line size, and the memorytype for both source and destination must be either write-back orwrite-combining (you can normally assume the latter condition ismet). Under these conditions, the speed is as high as you canobtain with vector register moves or even faster on someprocessors.While the string instructions can be quite convenient, it must beemphasized that other solutions are faster in many cases. If theabove conditions for fast move are not met then there is a lot togain by using other methods. See page 164 for alternatives to REPMOVS

Re: Replacing C's memcpy with a D implementation

Reply via email to