Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)
01.11.2017 10:46, Sven Barth via fpc-devel wrote: Am 01.11.2017 05:58 schrieb "J. Gareth Moreton">: Would it be worth opening up a bug report for this, with the attached assembler routines as suggestions? I haven't worked out how to implement internal functions into the compiler yet, and I rather clear it with you guys first before I make such an addition. I had a thought that the simple routines above could be used for when compiling for small code size, while larger, more advanced ones are used for when compiling for speed. Improvements like these are always welcome. Two points however: The Fill* routines are not part of the compiler, but of the RTL (the Pascal routines are in rtl/inc/generic.inc, the assembly ones reside in rtl/CPU/CPU.inc) and they aren't handled differently depending on the current optimization flags, so a one-size-fits-all is needed (look at e.g. the i386 ones). I also think that you might need to handle memory that isn't correctly aligned for the assembler instructions (I didn't look at your routines in detail so I don't know whether they'd need to be adjusted for that). A check of the i386 routines will probably help here as well. Another important thing to note is that all modifications to stack pointer and nonvolatile registers on x86_64 need SEH annotations in win64 and CFI annotations on linux/bsd. The former is available only in AT syntax, the latter is not supported. This requierment, together with different parameter locations, makes writing assembler routines for x86_64 much more complicated than for i386. For this reason, existing assembler routines in RTL avoid using nonvolatile registers as much as possible. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)
Am 01.11.2017 um 05:58 schrieb J. Gareth Moreton: > So I've been doing some playing around recently, and noticed that while > FillChar has some very fast internal > code for initialising a block of memory, making use of non-temporal hints and > memory fences, the versions > for the larger types fall back to slow Pascal code. It might be worth it to look at the Pascal versions from generic.inc first, and see if it is possible to come up with versions that generate faster code. I'm actually surprised "REP STOSD" should be that much faster. I remember it being slower on modern platforms than it used to be? -- Regards, Martok Ceterum censeo b32079 esse sanandam. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)
Am 01.11.2017 um 05:58 schrieb J. Gareth Moreton: > I also made versions that use memory fences and other checks such as memory > alignment in order to gain speed > - I've converted them to use the System V ABI of Linux as well, but are > currently completely untested as I > don't have the facilities to yet compile on Linux (they are also even smaller > in code size because you don't > need to push and pop RDI, and the destination (var x) is already stored in > RDI, thereby collapsing each > routine to just 3 instructions (not including the REP prefix)). > > Would it be worth opening up a bug report for this, with the attached > assembler routines as suggestions? Yes, for sure. > I > haven't worked out how to implement internal functions into the compiler yet, Fill* are not internal functions, so you just have to adapt the system unit. > and I rather clear it with you > guys first before I make such an addition. I had a thought that the simple > routines above could be used for > when compiling for small code size, while larger, more advanced ones are used > for when compiling for speed. I would provide only one version, after all, Fill* is only a very small part of the rtl, so shaving off a few bytes here does not matter and we are not in a 1k contest :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)
Am 01.11.2017 05:58 schrieb "J. Gareth Moreton": Would it be worth opening up a bug report for this, with the attached assembler routines as suggestions? I haven't worked out how to implement internal functions into the compiler yet, and I rather clear it with you guys first before I make such an addition. I had a thought that the simple routines above could be used for when compiling for small code size, while larger, more advanced ones are used for when compiling for speed. Improvements like these are always welcome. Two points however: The Fill* routines are not part of the compiler, but of the RTL (the Pascal routines are in rtl/inc/generic.inc, the assembly ones reside in rtl/CPU/CPU.inc) and they aren't handled differently depending on the current optimization flags, so a one-size-fits-all is needed (look at e.g. the i386 ones). I also think that you might need to handle memory that isn't correctly aligned for the assembler instructions (I didn't look at your routines in detail so I don't know whether they'd need to be adjusted for that). A check of the i386 routines will probably help here as well. Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)
So I've been doing some playing around recently, and noticed that while FillChar has some very fast internal code for initialising a block of memory, making use of non-temporal hints and memory fences, the versions for the larger types fall back to slow Pascal code. To showcase this, I ran a test on my 6-year-old laptop that compared a small and slightly basic assembler routine against the internal functions (times are averaged over 100 iterations): FillWord - initialise 16,777,216 words to 0 - Internal: 8177.209 µs - Assembler: 4234.131 µs FillWord - initialise 1,048,576 words to $ - Internal: 153.221 µs - Assembler: 86.496 µs FillWord - initialise 1,229 words to $ - Internal: 0.267 µs - Assembler: 0.135 µs FillDWord - initialise 16,777,216 DWords to 0 - Internal: 15552.032 µs - Assembler: 10945.809 µs FillDWord - initialise 1,048,576 DWords to $ - Internal: 902.060 µs - Assembler: 470.788 µs FillDWord - initialise 1,229 DWords to $ - Internal: 0.357 µs - Assembler: 0.275 µs FillQWord - initialise 16,777,216 QWords to 0 - Internal: 33397.248 µs - Assembler: 17488.901 µs FillQWord - initialise 1,048,576 QWords to $ - Internal: 2130.116 µs - Assembler: 1258.130 µs FillQWord - initialise 1,229 QWords to $ - Internal: 0.739 µs - Assembler: 0.402 µs The assembler functions were as follows: {$ASMMODE INTEL} procedure SizeOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8W = Value } PUSH RDI MOV AX, R8W MOV RDI, RCX MOV RCX, RDX REP STOSW POP RDI end; procedure SizeOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8D = Value } PUSH RDI MOV EAX, R8D MOV RDI, RCX MOV RCX, RDX REP STOSD POP RDI end; procedure SizeOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8 = Value } PUSH RDI MOV RAX, R8 MOV RDI, RCX MOV RCX, RDX REP STOSQ POP RDI end; I also made versions that use memory fences and other checks such as memory alignment in order to gain speed - I've converted them to use the System V ABI of Linux as well, but are currently completely untested as I don't have the facilities to yet compile on Linux (they are also even smaller in code size because you don't need to push and pop RDI, and the destination (var x) is already stored in RDI, thereby collapsing each routine to just 3 instructions (not including the REP prefix)). Would it be worth opening up a bug report for this, with the attached assembler routines as suggestions? I haven't worked out how to implement internal functions into the compiler yet, and I rather clear it with you guys first before I make such an addition. I had a thought that the simple routines above could be used for when compiling for small code size, while larger, more advanced ones are used for when compiling for speed. Yours faithfully, J. Gareth "Kit" Moreton ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel