Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)

2017-11-01 Thread Sergei Gorelkin via fpc-devel



01.11.2017 10:46, Sven Barth via fpc-devel wrote:
Am 01.11.2017 05:58 schrieb "J. Gareth Moreton" >:


Would it be worth opening up a bug report for this, with the attached 
assembler routines as
suggestions? I
haven't worked out how to implement internal functions into the compiler 
yet, and I rather clear
it with you
guys first before I make such an addition.  I had a thought that the simple 
routines above could
be used for
when compiling for small code size, while larger, more advanced ones are 
used for when compiling
for speed.


Improvements like these are always welcome. Two points however:
The Fill* routines are not part of the compiler, but of the RTL (the Pascal routines are in 
rtl/inc/generic.inc, the assembly ones reside in rtl/CPU/CPU.inc) and they aren't handled 
differently depending on the current optimization flags, so a one-size-fits-all is needed (look at 
e.g. the i386 ones).
I also think that you might need to handle memory that isn't correctly aligned for the assembler 
instructions (I didn't look at your routines in detail so I don't know whether they'd need to be 
adjusted for that). A check of the i386 routines will probably help here as well.




Another important thing to note is that all modifications to stack pointer and nonvolatile registers 
on x86_64 need SEH annotations in win64 and CFI annotations on linux/bsd. The former is available 
only in AT syntax, the latter is not supported.
This requierment, together with different parameter locations, makes writing assembler routines for 
x86_64 much more complicated than for i386. For this reason, existing assembler routines in RTL 
avoid using nonvolatile registers as much as possible.

Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)

2017-11-01 Thread Martok
Am 01.11.2017 um 05:58 schrieb J. Gareth Moreton:
> So I've been doing some playing around recently, and noticed that while 
> FillChar has some very fast internal 
> code for initialising a block of memory, making use of non-temporal hints and 
> memory fences, the versions 
> for the larger types fall back to slow Pascal code.
It might be worth it to look at the Pascal versions from generic.inc first, and
see if it is possible to come up with versions that generate faster code.

I'm actually surprised "REP STOSD" should be that much faster. I remember it
being slower on modern platforms than it used to be?

-- 
Regards,
Martok

Ceterum censeo b32079 esse sanandam.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)

2017-11-01 Thread Florian Klämpfl
Am 01.11.2017 um 05:58 schrieb J. Gareth Moreton:

> I also made versions that use memory fences and other checks such as memory 
> alignment in order to gain speed 
> - I've converted them to use the System V ABI of Linux as well, but are 
> currently completely untested as I 
> don't have the facilities to yet compile on Linux (they are also even smaller 
> in code size because you don't 
> need to push and pop RDI, and the destination (var x) is already stored in 
> RDI, thereby collapsing each 
> routine to just 3 instructions (not including the REP prefix)).
> 
> Would it be worth opening up a bug report for this, with the attached 
> assembler routines as suggestions? 

Yes, for sure.

> I 
> haven't worked out how to implement internal functions into the compiler yet, 

Fill* are not internal functions, so you just have to adapt the system unit.

> and I rather clear it with you 
> guys first before I make such an addition.  I had a thought that the simple 
> routines above could be used for 
> when compiling for small code size, while larger, more advanced ones are used 
> for when compiling for speed.

I would provide only one version, after all, Fill* is only a very small part of 
the rtl, so shaving
off a few bytes here does not matter and we are not in a 1k contest :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)

2017-11-01 Thread Sven Barth via fpc-devel
Am 01.11.2017 05:58 schrieb "J. Gareth Moreton" :

Would it be worth opening up a bug report for this, with the attached
assembler routines as suggestions? I
haven't worked out how to implement internal functions into the compiler
yet, and I rather clear it with you
guys first before I make such an addition.  I had a thought that the simple
routines above could be used for
when compiling for small code size, while larger, more advanced ones are
used for when compiling for speed.


Improvements like these are always welcome. Two points however:
The Fill* routines are not part of the compiler, but of the RTL (the Pascal
routines are in rtl/inc/generic.inc, the assembly ones reside in
rtl/CPU/CPU.inc) and they aren't handled differently depending on the
current optimization flags, so a one-size-fits-all is needed (look at e.g.
the i386 ones).
I also think that you might need to handle memory that isn't correctly
aligned for the assembler instructions (I didn't look at your routines in
detail so I don't know whether they'd need to be adjusted for that). A
check of the i386 routines will probably help here as well.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)

2017-10-31 Thread J. Gareth Moreton
So I've been doing some playing around recently, and noticed that while 
FillChar has some very fast internal 
code for initialising a block of memory, making use of non-temporal hints and 
memory fences, the versions 
for the larger types fall back to slow Pascal code.  To showcase this, I ran a 
test on my 6-year-old laptop 
that compared a small and slightly basic assembler routine against the internal 
functions (times are 
averaged over 100 iterations):

FillWord - initialise 16,777,216 words to 0

- Internal: 8177.209 µs
- Assembler: 4234.131 µs

FillWord - initialise 1,048,576 words to $

- Internal: 153.221 µs
- Assembler: 86.496 µs

FillWord - initialise 1,229 words to $

- Internal: 0.267 µs
- Assembler: 0.135 µs

FillDWord - initialise 16,777,216 DWords to 0

- Internal: 15552.032 µs
- Assembler: 10945.809 µs

FillDWord - initialise 1,048,576 DWords to $

- Internal: 902.060 µs
- Assembler: 470.788 µs

FillDWord - initialise 1,229 DWords to $

- Internal: 0.357 µs
- Assembler: 0.275 µs

FillQWord - initialise 16,777,216 QWords to 0

- Internal: 33397.248 µs
- Assembler: 17488.901 µs

FillQWord - initialise 1,048,576 QWords to $

- Internal: 2130.116 µs
- Assembler: 1258.130 µs

FillQWord - initialise 1,229 QWords to $

- Internal: 0.739 µs
- Assembler: 0.402 µs


The assembler functions were as follows:
{$ASMMODE INTEL}

procedure SizeOptimisedFillWord(var x; count: SizeInt; Value: Word); 
assembler; nostackframe;
asm
  { RCX = Pointer to x
RDX = Count
R8W = Value }
  PUSH RDI
  MOV  AX,  R8W
  MOV  RDI, RCX
  MOV  RCX, RDX
  REP  STOSW
  POP  RDI
end;

procedure SizeOptimisedFillDWord(var x; count: SizeInt; Value: DWord); 
assembler; nostackframe;
asm
  { RCX = Pointer to x
RDX = Count
R8D = Value }
  PUSH RDI
  MOV  EAX, R8D
  MOV  RDI, RCX
  MOV  RCX, RDX
  REP  STOSD
  POP  RDI
end;

procedure SizeOptimisedFillQWord(var x; count: SizeInt; Value: QWord); 
assembler; nostackframe;
asm
  { RCX = Pointer to x
RDX = Count
R8  = Value }
  PUSH RDI
  MOV  RAX, R8
  MOV  RDI, RCX
  MOV  RCX, RDX
  REP  STOSQ
  POP  RDI
end;


I also made versions that use memory fences and other checks such as memory 
alignment in order to gain speed 
- I've converted them to use the System V ABI of Linux as well, but are 
currently completely untested as I 
don't have the facilities to yet compile on Linux (they are also even smaller 
in code size because you don't 
need to push and pop RDI, and the destination (var x) is already stored in RDI, 
thereby collapsing each 
routine to just 3 instructions (not including the REP prefix)).

Would it be worth opening up a bug report for this, with the attached assembler 
routines as suggestions? I 
haven't worked out how to implement internal functions into the compiler yet, 
and I rather clear it with you 
guys first before I make such an addition.  I had a thought that the simple 
routines above could be used for 
when compiling for small code size, while larger, more advanced ones are used 
for when compiling for speed.

Yours faithfully,

J. Gareth "Kit" Moreton
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel