Thanks for the feedback everyone. I wasn't sure about internal functions because FillWord, for example, is surrounded by "{$ifndef FPC_SYSTEM_HAS_FILLWORD}", which isn't defined under Win64, whereas FPC_SYSTEM_HAS_FILLCHAR is defined and the implementation of FillChar is nowhere to be found when you try to search for it in Lazarus.
For the speed-optimised assembler routines, I have the following (which does borrow ideas from FillChar): procedure SpeedOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8W = Value } PUSH RDI MOVZX RAX, R8W MOV R9, $0001000100010001 MOV RDI, RCX IMUL RAX, R9 { Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) } AND CL, $6 JZ @Aligned8 TEST CL, $2 JZ @Aligned4 MOV [RDI], R8W DEC RDX ADD RDI, $2 TEST CL, $4 JNZ @Aligned8 { Note that it's NOT zero here, because if TEST CL, $4 sets ZF here, then the memory block was originally 2 bytes away from the boundary } @Aligned4: MOV [RDI], EAX SUB RDX, $2 ADD RDI, $4 @Aligned8: MOV R10B,DL SHR RDX, 2 AND R10B,$3 MOV RCX, RDX CMP RDX, $80000 JB @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ } SHR RDX, 2 AND RCX, $3 { Write 32 bytes at a time using a non-temporal hint } @BlockLoop: ADD RDI, $20 MOVNTI [RDI-$20], RAX MOVNTI [RDI-$18], RAX DEC RDX MOVNTI [RDI-$10], RAX MOVNTI [RDI-$8], RAX JNZ @BlockLoop MFENCE @NoBlocks: SHR R10B, 1 REP STOSQ JNC @NoLooseWord MOV [RDI], R8W LEA RDI, [RDI+2] @NoLooseWord: JZ @NoLooseDWord MOV [RDI], EAX @NoLooseDWord: POP RDI end; procedure SpeedOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8W = Value } PUSH RDI MOV RAX, R8 MOV RDI, RCX SHL RAX, 32 OR RAX, R8 { Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) } AND CL, $4 JZ @Aligned8 MOV [RDI], R8D DEC RDX ADD RDI, $4 @Aligned8: SHR RDX, 1 SETC R10B MOV RCX, RDX CMP RDX, $80000 JB @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ } SHR RDX, 2 AND RCX, $3 { Write 32 bytes at a time using a non-temporal hint } @BlockLoop: ADD RDI, $20 MOVNTI [RDI-$20], RAX MOVNTI [RDI-$18], RAX DEC RDX MOVNTI [RDI-$10], RAX MOVNTI [RDI-$8], RAX JNZ @BlockLoop MFENCE @NoBlocks: TEST R10B, R10B REP STOSQ JZ @NoLooseDWord MOV [RDI], EAX @NoLooseDWord: POP RDI end; procedure SpeedOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8 = Value } PUSH RDI CMP RDX, $80000 MOV RDI, RCX MOV RCX, RDX JB @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ } AND RCX, $3 SHR RDX, 2 JZ @NoBlocks { Write 32 bytes at a time using a non-temporal hint } @BlockLoop: ADD RDI, $20 MOVNTI [RDI-$20], R8 MOVNTI [RDI-$18], R8 DEC RDX MOVNTI [RDI-$10], R8 MOVNTI [RDI-$8], R8 JNZ @BlockLoop MFENCE @NoBlocks: MOV RAX, R8 REP STOSQ POP RDI end; Regarding the CFI annotations, these functions are actually even better under Linux x64 because RDI is volatile and doesn't need to be pushed and popped, and those operations were the only things that modified the stack pointer... and since the above routines don't call any other procedures, we can use "nostackframe" safely. I am tempted to experiment a little further, because one thing that's guaranteed to be present under x64 is SSE2, so it may be possible to increase the speed even more, although at the same time there may be a performance penalty if the rest of the application uses AVX or floating-point SSE. J. Gareth "Kit" Moreton On Wed 01/11/17 11:03 , Sergei Gorelkin via fpc-devel fpc-devel@lists.freepascal.org sent: > > > > > 01.11.2017 10:46, Sven Barth via fpc-devel wrote: > > > Am 01.11.2017 05:58 schrieb "J. Gareth > Moreton" e...@moreton-family.com > > e...@moreton-family.com>>: > > > > > Would it be worth opening up a bug report > for this, with the attached assembler routines as > > suggestions? I > > > haven't worked out how to implement internal > functions into the compiler yet, and I rather clear > > it with you > > > guys first before I make such an > addition. I had a thought that the simple routines above could > > be used for > > > when compiling for small code size, while > larger, more advanced ones are used for when compiling > > for speed. > > > > > > > > > Improvements like these are always welcome. Two > points however: > > The Fill* routines are not part of the compiler, > but of the RTL (the Pascal routines are in > > rtl/inc/generic.inc, the assembly ones reside in > rtl/CPU/CPU.inc) and they aren't handled > > differently depending on the current > optimization flags, so a one-size-fits-all is needed (look at > > e.g. the i386 ones). > > > I also think that you might need to handle > memory that isn't correctly aligned for the assembler > > instructions (I didn't look at your routines in > detail so I don't know whether they'd need to be > > adjusted for that). A check of the i386 routines > will probably help here as well. > > > > > > Another important thing to note is that all modifications to stack pointer > and nonvolatile registers > on x86_64 need SEH annotations in win64 and CFI annotations on linux/bsd. > The former is available > only in AT&T syntax, the latter is not supported. > > This requierment, together with different parameter locations, makes > writing assembler routines for > x86_64 much more complicated than for i386. For this reason, existing > assembler routines in RTL > avoid using nonvolatile registers as much as possible. > > Regards, > > Sergei > > _______________________________________________ > > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > > > > _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel