from:"J. Gareth Moreton via fpc\-devel"

Re: [fpc-devel] Peephole issue in 3.2.3

2024-06-02 Thread J. Gareth Moreton via fpc-devel

I made a post on that topic, as I remember that bug.  It was a peephole 
optimisation that I introduced, but before anyone tries to lynch me for 
breaking the compiler, the reason why things broke is because it makes 
the optimisation on the assumption that the FLAGS register was not in 
use.  However, another part of the compiler that generated the 
comparison instructions neglected to mark the FLAGS were in use, so they 
got scrambled at times.


Thanks for listing the commit Martin!

6f24c8b4efccea67d092062009f413cc789a052c

This should be cherry-picked into the fixes branch because it fixes a 
bug that caused bad code to be generated.


Kit

On 02/06/2024 11:17, Martin Frb via fpc-devel wrote:
While not sure how 3.2.4 preparation are going, nor what is still 
outstanding for merging, I just wanted to quickly check if the 
following is know (and hopefully making the list)

https://forum.lazarus.freepascal.org/index.php/topic,67448.0.html

From what I can tell
- present in today's 3.2.3
- probably peephole

Merely from the commit message, possibly
https://gitlab.com/freepascal.org/fpc/source/-/commit/6f24c8b4efccea67d092062009f413cc789a052c 


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] x86 assembler improvements, patch

2024-05-29 Thread J. Gareth Moreton via fpc-devel


Heh, that's fair.  It was essentially how we did it in the past anyway!

Kit

On 29/05/2024 22:08, Florian Klämpfl via fpc-devel wrote:

On 29.05.24 13:09, J. Gareth Moreton via fpc-devel wrote:

Shouldn't this be made as a merge request?


I am sure we can handle also such patches. Merge requests are easier 
but patches are perfectly fine as well.




On 28/05/2024 07:12, Marģers . via fpc-devel wrote:

Some compiler x86 assembler improvements

1) patch for fpc 3.3.1 (attachment: mkx86ins_version_bump.patch)
compiler/utils/mkx86ins.pp
Version bumped from 1.6.1 to 1.6.2
There has been changes to code, so version has to represent that.

2) Patch to enable ENTER asm instruction (attachment: 
enable_asm_instr_enter.patch)

same for fpc 3.3.1 and fixes 3.2

3) patch for fpc 3.3.1 compiler/x86/x86ins.dat (attachment: 
x86ins_4_fpc331.patch)

3.1)
Rename 3DNow instruction (fixed long lasting typo in mnemonic).
PMULHRWA  --> PMULHRW
3.2)
Add vpclmullqlqdq, vpclmulhqlqdq, vpclmullqhqdq, vpclmulhqhqdq.
3.3)
Fix "typo" for SHA1MSG2


4) patch asm instructions for fixes 3.2 (attachment: 
x86ins_4_fixes32.patch)

add missing instructions of BMI1, BMI2, ADX, CMUL, SHA, XSAVE, MOVBE
no "code" changes, only x86ins.dat and generated files with mkx86ins
Some instructions deliberately have wrong tags in order to make no 
changes beside x86ins.dat.



5) patch prof of concept back port asm instructions to fpc 3.0.4 
(attachment: x86ins_4_fpc304.patch)
add missing instructions of BMI1, BMI2, ADX, CMUL, SHA, XSAVE, 
MOVBE, RAND

no "code" changes, but const maxinfolen = 8; to maxinfolen = 9;
x86ins.dat and generated files with mkx86ins
I did this to make an argument that it's safe to add asm 
instructions to fpc 3.2.3
Engine, that supports those instruction, is in production for a 
while now.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] x86 assembler improvements, patch

2024-05-29 Thread J. Gareth Moreton via fpc-devel


Shouldn't this be made as a merge request?

On 28/05/2024 07:12, Marģers . via fpc-devel wrote:

Some compiler x86 assembler improvements

1) patch for fpc 3.3.1 (attachment: mkx86ins_version_bump.patch)
compiler/utils/mkx86ins.pp
Version bumped from 1.6.1 to 1.6.2
There has been changes to code, so version has to represent that.

2) Patch to enable ENTER asm instruction (attachment: 
enable_asm_instr_enter.patch)

same for fpc 3.3.1 and fixes 3.2

3) patch for fpc 3.3.1 compiler/x86/x86ins.dat (attachment: 
x86ins_4_fpc331.patch)

3.1)
Rename 3DNow instruction (fixed long lasting typo in mnemonic).
PMULHRWA  --> PMULHRW
3.2)
Add vpclmullqlqdq, vpclmulhqlqdq, vpclmullqhqdq, vpclmulhqhqdq.
3.3)
Fix "typo" for SHA1MSG2


4) patch asm instructions for fixes 3.2 (attachment: 
x86ins_4_fixes32.patch)

add missing instructions of BMI1, BMI2, ADX, CMUL, SHA, XSAVE, MOVBE
no "code" changes, only x86ins.dat and generated files with mkx86ins
Some instructions deliberately have wrong tags in order to make no 
changes beside x86ins.dat.



5) patch prof of concept back port asm instructions to fpc 3.0.4 
(attachment: x86ins_4_fpc304.patch)
add missing instructions of BMI1, BMI2, ADX, CMUL, SHA, XSAVE, MOVBE, 
RAND

no "code" changes, but const maxinfolen = 8; to maxinfolen = 9;
x86ins.dat and generated files with mkx86ins
I did this to make an argument that it's safe to add asm instructions 
to fpc 3.2.3
Engine, that supports those instruction, is in production for a while 
now.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Windows for AArch64

2024-05-27 Thread J. Gareth Moreton via fpc-devel

Hopefully the first issue has now been resolved, although it may require 
refactoring later.


https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/691

The main issue is that, for some reason, trying to dereference from a 
code section instead of a data section raises an access violation, 
although I'm not sure if it's a permissions issue or a subtle fault with 
the pointers.  This may require refactoring later on or if bugs manifest 
in larger projects.


Kit

On 26/05/2024 21:33, J. Gareth Moreton via fpc-devel wrote:


Thank you for all your assistance with this Sven.

One trick I have been doing is writing equivalent programs in C/C++ to 
see how Clang and MSVC convert them into equivalent assembly 
language.  It's providing some insights at least in regards to what 
works.  My first attempted fix (putting the jump table in the same 
section as the actual code) unfortunately didn't work, so it's 
something more subtle.


Kit

On 26/05/2024 11:55, Sven Barth via fpc-devel wrote:
J. Gareth Moreton via fpc-devel  
schrieb am Sa., 25. Mai 2024, 22:18:


Indeed - I'm not giving up!  I installed Clang via LLVM.  Which
of the EXE files is actually the assembler?  It's not entirely
clear (no "clang-as", for example).  (Although I trust it works!)


Simply check what FPC calls. ;)

I've got some ideas as to how to start debugging. I will solve
this puzzle!

There is a tool that converts DWARF to CodeView, you can use that to 
debug with WinDBG.


Regards,
Sven


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Windows for AArch64

2024-05-26 Thread J. Gareth Moreton via fpc-devel


Thank you for all your assistance with this Sven.

One trick I have been doing is writing equivalent programs in C/C++ to 
see how Clang and MSVC convert them into equivalent assembly language.  
It's providing some insights at least in regards to what works.  My 
first attempted fix (putting the jump table in the same section as the 
actual code) unfortunately didn't work, so it's something more subtle.


Kit

On 26/05/2024 11:55, Sven Barth via fpc-devel wrote:
J. Gareth Moreton via fpc-devel  
schrieb am Sa., 25. Mai 2024, 22:18:


Indeed - I'm not giving up!  I installed Clang via LLVM.  Which of
the EXE files is actually the assembler?  It's not entirely clear
(no "clang-as", for example).  (Although I trust it works!)


Simply check what FPC calls. ;)

I've got some ideas as to how to start debugging.  I will solve
this puzzle!

There is a tool that converts DWARF to CodeView, you can use that to 
debug with WinDBG.


Regards,
Sven


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Windows for AArch64

2024-05-25 Thread J. Gareth Moreton via fpc-devel

Indeed - I'm not giving up!  I installed Clang via LLVM.  Which of the 
EXE files is actually the assembler?  It's not entirely clear (no 
"clang-as", for example).  (Although I trust it works!)


I've got some ideas as to how to start debugging.  I will solve this puzzle!

Kit

On 25/05/2024 16:42, Sven Barth via fpc-devel wrote:
J. Gareth Moreton via fpc-devel  
schrieb am Sa., 25. Mai 2024, 10:49:


Thought I'd give a small update.

I was distracted over the past month with work, the arm-linux
blocking
bug and a couple of merge requests which were much easier to develop!
I'm now having a solid bash at getting Windows on ARM64 working. 
It's
proving harder than anticipated because I can't install common helper
tools like Cygwin because there isn't a native AArch64 version
available
(and x64 is not supported for emulation, it seems... only x86), and
Microsoft Visual Studio (which contains a working assembler)
absolutely
refuses to install because Windows 10 on ARM64 is not supported, only
Windows 11 (and my Raspberry Pi is not "ready" to upgrade to
Windows 11).


Emulation of x86_64 requires Windows 11 ;)

You should be able to install clang natively, then you can use that 
assembler. After all that's the one we need to cooperate with anyway...



I did manage to get the make process to complete with the options
that
Sven listed, but despite all of the packages building, the resultant
"ppca64" executable immediately exited with no messages or anything,
even if I specified "ppca64 -i" to display supported information. 
I'm
not sure if this is due to the bugs regarding exceptions and case
blocks, or some other reason.


It's very likely the case blocks, cause the compiler contains quite a 
lot of them.


Regards,
Sven

___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Windows for AArch64

2024-05-25 Thread J. Gareth Moreton via fpc-devel


Thought I'd give a small update.

I was distracted over the past month with work, the arm-linux blocking 
bug and a couple of merge requests which were much easier to develop!  
I'm now having a solid bash at getting Windows on ARM64 working.  It's 
proving harder than anticipated because I can't install common helper 
tools like Cygwin because there isn't a native AArch64 version available 
(and x64 is not supported for emulation, it seems... only x86), and 
Microsoft Visual Studio (which contains a working assembler) absolutely 
refuses to install because Windows 10 on ARM64 is not supported, only 
Windows 11 (and my Raspberry Pi is not "ready" to upgrade to Windows 11).


I did manage to get the make process to complete with the options that 
Sven listed, but despite all of the packages building, the resultant 
"ppca64" executable immediately exited with no messages or anything, 
even if I specified "ppca64 -i" to display supported information.  I'm 
not sure if this is due to the bugs regarding exceptions and case 
blocks, or some other reason.


Nevertheless, it is a problem I am determined to solve.  I hope I'm not 
acting too greedy with the bounty, but one of my contracts is extremely 
late in paying me and it's left me in a dire financial position... 
hunger can be a good motivator sometimes! Also I figured since I have 
some experience with AArch64, this is something I should be able to handle.


Gareth aka. Kit

On 29/04/2024 21:31, Sven Barth via fpc-devel wrote:

Am 29.04.2024 um 08:42 schrieb J. Gareth Moreton via fpc-devel:
Aah, partially answered.  It's not supported in 3.2.2, but there is 
better support for it in the trunk.


You had me worried there for a moment that someone regenerated the 
makefiles with an older version of fpcmake... ^^'


Anyway, aside from using main you need to make sure that you have a 
current version of clang installed and preferrably in PATH, because 
FPC uses it as assembler.


You then call make like this (adding parallelisation options as 
desired of course):


make all OS_TARGET=win64 CPU_TARGET=aarch64 BINUTILSPREFIX=

(note that there is a space between the last "=" and the end of the 
command line)


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Windows for AArch64

2024-04-29 Thread J. Gareth Moreton via fpc-devel

Thanks Sven.  I'm predicting a future of Windows on AArch64, since 
Windows is not going anywhere but Arm processors are starting to really 
take off beyond mobile devices.


Kit

On 29/04/2024 21:31, Sven Barth via fpc-devel wrote:

Am 29.04.2024 um 08:42 schrieb J. Gareth Moreton via fpc-devel:
Aah, partially answered.  It's not supported in 3.2.2, but there is 
better support for it in the trunk.


You had me worried there for a moment that someone regenerated the 
makefiles with an older version of fpcmake... ^^'


Anyway, aside from using main you need to make sure that you have a 
current version of clang installed and preferrably in PATH, because 
FPC uses it as assembler.


You then call make like this (adding parallelisation options as 
desired of course):


make all OS_TARGET=win64 CPU_TARGET=aarch64 BINUTILSPREFIX=

(note that there is a space between the last "=" and the end of the 
command line)


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Windows for AArch64

2024-04-29 Thread J. Gareth Moreton via fpc-devel

Aah, partially answered.  It's not supported in 3.2.2, but there is 
better support for it in the trunk.


Kit

On 29/04/2024 06:42, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

I may need some help with this one.  Is there a tried and tested way 
of getting FPC to build and install on aarch64-win64? (I assume that's 
the correct OS for Windows for ARM64).  The make script doesn't seem 
to accept the combination of CPU_TARGET=aarch64 OS_TARGET=win64 and 
it's a struggle to even find the right tools for this platform.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Windows for AArch64

2024-04-28 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I may need some help with this one.  Is there a tried and tested way of 
getting FPC to build and install on aarch64-win64? (I assume that's the 
correct OS for Windows for ARM64).  The make script doesn't seem to 
accept the combination of CPU_TARGET=aarch64 OS_TARGET=win64 and it's a 
struggle to even find the right tools for this platform.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Free Pascal for Windows aarch64 Bug Bounties

2024-04-27 Thread J. Gareth Moreton via fpc-devel


I figured that would be the case with the PE format.  Fun times ahead!

Kit

On 27/04/2024 13:06, Sven Barth via fpc-devel wrote:
J. Gareth Moreton via fpc-devel  
schrieb am Sa., 27. Apr. 2024, 10:00:


You've piqued my interest.  I currently only have the ability to
develop on aarch64-linux (Raspberry Pi 400), but I'm curious to
know if I can get a version of Windows to run on it, even if the
performance will be very bad.


In a Pi 4 it runs relatively well. I've done it both natively and as a 
KVM. Check the WoRProject ( https://worproject.com/ ).
A Pi 5 would be even better though there you currently might need to 
use KVM as the native drivers are still a work in progress.


So far I've tried to reproduce the issues on aarch64-linux without
any success.  40203 makes sense because it may be a specific issue
with Windows exception handling, but 40198 is a mystery because
it's more closely tied with code generation.  Can you verify that
the problem illustrated on 40198 still occurs on the trunk? (I've
confirmed that if there are more than 10 case statements, the code
generator attempts to use a jump table unless optimisations are
turned off)

It's a Windows specific issue, cause the PE format has slightly 
different relocation requirements than the ELF format.


Regards,
Sven


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] error target i386 -Cp80486

2024-04-23 Thread J. Gareth Moreton via fpc-devel

Got it!  I would have solved this much sooner if I had just dived into 
the code instead of getting lost with git bisect!


https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/655

It was an infinite loop in the "var13", "var14" and "var15" 
optimisations that don't trigger on the Pentium II (default option) and 
later.  They set the Result to True even if no changes were actually 
made and caused the compiler to check for the same optimisation over and 
over again without progressing.


Kit

P.S. Admittedly those optimisations could use some maintenance since I 
shouldn't need to set the operand sizes, but I'll save that for another day.


On 24/04/2024 05:08, J. Gareth Moreton via fpc-devel wrote:

I've gotten back as far as here, and it's still bad:

commit 7fbda0e0e8b1d071e72ccbc5e487dbb1c2173c63 (HEAD)
Author: Jonas Maebe 
Date:   Wed Mar 24 14:33:09 2021 +

  * support building with FPC 3.2.2

    git-svn-id: trunk@49045 -

I can't check earlier without difficulty because 3.2.2 doesn't work.  
However the hang (and also an eventual out of memory error) doesn't 
occur if -OoNOPEEPHOLE is specified, meaning the problem is located in 
the peephole optimizer.  So it looks like I'll have to try to debug 
this the hard way!


Kit

On 23/04/2024 17:27, J. Gareth Moreton via fpc-devel wrote:
I've reproduced the hang doing "make clean all CPU_TARGET=i386 
OS_TARGET=win32 OPT="-Cp80486 -Op80486"" on my x86_64-win64 machine.


So far I haven't found the bad commit - this problem has been here a 
while.


Kit

I still haven't found the bad commit!

On 23/04/2024 12:46, J. Gareth Moreton via fpc-devel wrote:

Absolutely I can.  I'll see what I can find.

Gareth aka. Kit

On 23/04/2024 12:09, Tomas Hajny via fpc-devel wrote:

On 2024-04-23 11:50, Marģers . via fpc-devel wrote:

1) does not work
make clean singlezipinstall OS_TARGET=win32 CPU_TARGET=i386
ALLOW_WARNINGS=1 OPT="  -O2 -vxitl -Cp80486 -Op80486"

hangs on
system.inc(421,2) Start reading includefile
C:\Users\Lietotajs\Downloads\fora\a\486\gh\rtl\inc\generic.inc

 .
 .

Indeed. This is clearly an issue related to optimizations. The 
important point missing here is that it happens when compiling with 
ppc1.exe. I just tried building a trunk compiler (make rtl_all 
compiler_all) with "OPT=-O- -Cp80486 -Op80486" and then started 
another compilation round with compiler built this way, but using 
-O2 this time, in order to make sure that it wasn't an optimization 
bug in the released 3.2.2 compiler. The new compilation round 
(using a non-optimized compiler) hangs at an exactly the same 
place, i.e. it's a bug (probably endless loop?) in the optimization 
code specific to the trunk compiler. @Gareth, could you please have 
a look at it?


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] error target i386 -Cp80486

2024-04-23 Thread J. Gareth Moreton via fpc-devel


I've gotten back as far as here, and it's still bad:

commit 7fbda0e0e8b1d071e72ccbc5e487dbb1c2173c63 (HEAD)
Author: Jonas Maebe 
Date:   Wed Mar 24 14:33:09 2021 +

  * support building with FPC 3.2.2

    git-svn-id: trunk@49045 -

I can't check earlier without difficulty because 3.2.2 doesn't work.  
However the hang (and also an eventual out of memory error) doesn't 
occur if -OoNOPEEPHOLE is specified, meaning the problem is located in 
the peephole optimizer.  So it looks like I'll have to try to debug this 
the hard way!


Kit

On 23/04/2024 17:27, J. Gareth Moreton via fpc-devel wrote:
I've reproduced the hang doing "make clean all CPU_TARGET=i386 
OS_TARGET=win32 OPT="-Cp80486 -Op80486"" on my x86_64-win64 machine.


So far I haven't found the bad commit - this problem has been here a 
while.


Kit

I still haven't found the bad commit!

On 23/04/2024 12:46, J. Gareth Moreton via fpc-devel wrote:

Absolutely I can.  I'll see what I can find.

Gareth aka. Kit

On 23/04/2024 12:09, Tomas Hajny via fpc-devel wrote:

On 2024-04-23 11:50, Marģers . via fpc-devel wrote:

1) does not work
make clean singlezipinstall OS_TARGET=win32 CPU_TARGET=i386
ALLOW_WARNINGS=1 OPT="  -O2 -vxitl -Cp80486 -Op80486"

hangs on
system.inc(421,2) Start reading includefile
C:\Users\Lietotajs\Downloads\fora\a\486\gh\rtl\inc\generic.inc

 .
 .

Indeed. This is clearly an issue related to optimizations. The 
important point missing here is that it happens when compiling with 
ppc1.exe. I just tried building a trunk compiler (make rtl_all 
compiler_all) with "OPT=-O- -Cp80486 -Op80486" and then started 
another compilation round with compiler built this way, but using 
-O2 this time, in order to make sure that it wasn't an optimization 
bug in the released 3.2.2 compiler. The new compilation round (using 
a non-optimized compiler) hangs at an exactly the same place, i.e. 
it's a bug (probably endless loop?) in the optimization code 
specific to the trunk compiler. @Gareth, could you please have a 
look at it?


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] error target i386 -Cp80486

2024-04-23 Thread J. Gareth Moreton via fpc-devel

I've reproduced the hang doing "make clean all CPU_TARGET=i386 
OS_TARGET=win32 OPT="-Cp80486 -Op80486"" on my x86_64-win64 machine.


So far I haven't found the bad commit - this problem has been here a while.

Kit

I still haven't found the bad commit!

On 23/04/2024 12:46, J. Gareth Moreton via fpc-devel wrote:

Absolutely I can.  I'll see what I can find.

Gareth aka. Kit

On 23/04/2024 12:09, Tomas Hajny via fpc-devel wrote:

On 2024-04-23 11:50, Marģers . via fpc-devel wrote:

1) does not work
make clean singlezipinstall OS_TARGET=win32 CPU_TARGET=i386
ALLOW_WARNINGS=1 OPT="  -O2 -vxitl -Cp80486 -Op80486"

hangs on
system.inc(421,2) Start reading includefile
C:\Users\Lietotajs\Downloads\fora\a\486\gh\rtl\inc\generic.inc

 .
 .

Indeed. This is clearly an issue related to optimizations. The 
important point missing here is that it happens when compiling with 
ppc1.exe. I just tried building a trunk compiler (make rtl_all 
compiler_all) with "OPT=-O- -Cp80486 -Op80486" and then started 
another compilation round with compiler built this way, but using -O2 
this time, in order to make sure that it wasn't an optimization bug 
in the released 3.2.2 compiler. The new compilation round (using a 
non-optimized compiler) hangs at an exactly the same place, i.e. it's 
a bug (probably endless loop?) in the optimization code specific to 
the trunk compiler. @Gareth, could you please have a look at it?


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] error target i386 -Cp80486

2024-04-23 Thread J. Gareth Moreton via fpc-devel


Absolutely I can.  I'll see what I can find.

Gareth aka. Kit

On 23/04/2024 12:09, Tomas Hajny via fpc-devel wrote:

On 2024-04-23 11:50, Marģers . via fpc-devel wrote:

1) does not work
make clean singlezipinstall OS_TARGET=win32 CPU_TARGET=i386
ALLOW_WARNINGS=1 OPT="  -O2 -vxitl -Cp80486 -Op80486"

hangs on
system.inc(421,2) Start reading includefile
C:\Users\Lietotajs\Downloads\fora\a\486\gh\rtl\inc\generic.inc

 .
 .

Indeed. This is clearly an issue related to optimizations. The 
important point missing here is that it happens when compiling with 
ppc1.exe. I just tried building a trunk compiler (make rtl_all 
compiler_all) with "OPT=-O- -Cp80486 -Op80486" and then started 
another compilation round with compiler built this way, but using -O2 
this time, in order to make sure that it wasn't an optimization bug in 
the released 3.2.2 compiler. The new compilation round (using a 
non-optimized compiler) hangs at an exactly the same place, i.e. it's 
a bug (probably endless loop?) in the optimization code specific to 
the trunk compiler. @Gareth, could you please have a look at it?


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Planning refactor of x86 OptPass1MOV

2024-04-15 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

While I'm starting to focus more on node-level optimisations, since they 
benefit more platforms and can produce better optimisations in some 
situations due to cross-platform support and things like register 
allocation, I'm thinking about refactoring and improving x86's 
OptPass1MOV routine for the following main reasons:


 * It has become a massive, monolithic Frankenstein's monster of an
   optmisation routine with lots of different optimisations and styles
   of optimisation that are a little difficult to track sometimes (this
   is partly a given though since about 25% of all x86 assembly
   language consists of a MOV instruction).  Some refactoring will make
   it more maintainable and potentially faster and smaller.
 * It's not as neat and far-reaching as, say, the equivalent routine
   for the AArch64 peephole optimizer, which utilises
   GetNextInstructionUsingReg that can produce more long-range
   optimisations under -O3.

In other news, pure functions are still waiting for some third-party 
testing although I've fixed a couple of bugs in the interim.  Currently, 
the main sticking point are that floating-point values are not being 
propagated (pure function analysis utilises constant propagation, 
deadstore removal and in-depth node optimisation to calculate output 
values).  Making a patch to enable this has revealed some shortcomings, 
notably how programs and tests behave when floating-point exceptions are 
enabled (operations where deterministic values are divided by zero are 
being replaced with constants equalling infinity instead of raising an 
exception) and also some inconsistent behaviour across platforms where 
the Currency type is concerned.  Nevertheless, things are coming together.


Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] wrong result for abs(low(int64))

2024-04-04 Thread J. Gareth Moreton via fpc-devel

Essentially, an arithmetic overflow is happening.  Since the largest 
Int64 possible is 9,223,372,036,853,775,807, going one above that (the 
result to abs(low(int64))) wraps back around to -9,223,372,036,853,775,808.


Internally, you can think about negating (positing?) a negative number 
as inverting the bits and then adding one (aka. two's complement), so 
low(int64) is 1000...000, which inverted becomes 0111...111, and then 
adding one results in 1000...000 again.


Kit

On 04/04/2024 14:14, Martin Frb via fpc-devel wrote:

The below writes:  -9223372036854775808

Shouldn't absolute return a positive number?

program Project1;
begin
  writeln( abs(low(int64)) );
end.



Seems
writeln( abs(low(longint)) );

also returns negative...

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unaligned access on Cortex-M0 in Initialization code

2024-04-01 Thread J. Gareth Moreton via fpc-devel


Oops - I'm sorry for my introduced bug!

Gareth aka. Kit

On 31/03/2024 21:48, Michael Ring via fpc-devel wrote:


Works, thank you!

Michael

Am 31.03.24 um 22:18 schrieb Florian Klämpfl via fpc-devel:



Am 31.03.2024 um 21:58 schrieb Michael Ring via fpc-devel 
:


This is what I see (guess the same thing):

New Compiler:

FPC_INITIALIZE:
.Lc3882:
# path: /Users/ring/devel/fpc/rtl/inc/
# file: rtti.inc
# indx: 19
.Ll10741:
    push    {r4,r5,r14}
...

    ldr r0,[r0, r1]
    mov r15,r0
.La5:
    .long   .Lj13323



My last commit should fix it.


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] i386-win32 -CriotR fails to build

2024-03-01 Thread J. Gareth Moreton via fpc-devel


Excellent, thank you Michael.

Kit

On 01/03/2024 20:56, Michael Van Canneyt via fpc-devel wrote:



On Fri, 1 Mar 2024, J. Gareth Moreton via fpc-devel wrote:

Just want to confirm that the failure also occurs on x86_64-win64 
under -CriotR rules.


On all platforms. I fixed compilation with these flags.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] i386-win32 -CriotR fails to build

2024-03-01 Thread J. Gareth Moreton via fpc-devel

Just want to confirm that the failure also occurs on x86_64-win64 under 
-CriotR rules.


Kit

On 01/03/2024 18:18, J. Gareth Moreton via fpc-devel wrote:

Hi everyone.

As part of my automated tests I try to build the compiler and packages 
on i386-win32 under the options "-O4 -CriotR".  Doing so gives a 
failure with the vcl_compat package (the failure also occurs with just 
"-CriotR").  Can others confirm?


External command 
"C:/Users/garet/Documents/programming/fpc-opts/compiler/ppc386.exe 
-Twin32 -FUvcl-compat\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\rtl\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-base\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\rtl-objpas\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-xml\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-web\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-db\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\rtl-extra\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\ibase\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\mysql\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\odbc\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\oracle\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\postgres\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\sqlite\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\dblib\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\pxlib\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-json\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-fpcunit\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\paszlib\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\hash\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\libtar\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-net\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-passrc\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-process\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-hash\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-registry\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\openssl\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fastcgi\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\httpd22\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\httpd24\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\winunits-base\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\libmicrohttpd\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\rtl-generics\units\i386-win32\ 
-Fuvcl-compat\src -Fivcl-compat\src -Ur -Xs -O2 -n -O4 -CriotR -di386 
-dRELEASE -XX -CX -Sc -viq vcl-compat\BuildUnit_vcl_compat.pp" failed 
with exit code 1. Console output:

Target OS: Win32 for i386
Compiling vcl-compat\BuildUnit_vcl_compat.pp
Compiling .\vcl-compat\src\system.permissions.pp
Compiling .\vcl-compat\src\system.messaging.pp
Compiling .\vcl-compat\src\system.netencoding.pp
Writing Resource String Table file: system.netencoding.rsj
Compiling .\vcl-compat\src\system.ioutils.pp
Writing Resource String Table file: system.ioutils.rsj
Compiling .\vcl-compat\src\system.devices.pp
Compiling .\vcl-compat\src\system.analytics.pp
Compiling .\vcl-compat\src\system.ansistrings.pp
Compiling .\vcl-compat\src\system.imagelist.pp
Compiling .\vcl-compat\src\system.diagnostics.pp
Compiling .\vcl-compat\src\system.notification.pp
Compiling .\vcl-compat\src\system.json.pp
Writing Resource String Table file: system.json.rsj
Compiling .\vcl-compat\src\system.pushnotifications.pp
Writing Resource String Table file: system.pushnotifications.rsj
Compiling .\vcl-compat\src\system.hash.pp
Writing Resource String Table file: system.hash.rsj
Compiling .\vcl-compat\src\system.credentials.pp
Writing Resource String Table file: system.credentials.rsj
Compiling .\vcl-compat\src\system.threading.pp
system.threading.pp(3953,51) Error: Incompatible type for arg no. 2: 
Got "Class Of TAbstractTask.IInternalTask", expected "TClass"
system.threading.pp(5030) Fatal: There were 1 errors compiling module, 
stopping

Fatal: Compilation aborted

The installer encountered the following error:
Compilation of "BuildUnit_vcl_compat.pp" failed
  $00495903
  $0049D1E2

[fpc-devel] i386-win32 -CriotR fails to build

2024-03-01 Thread J. Gareth Moreton via fpc-devel


Hi everyone.

As part of my automated tests I try to build the compiler and packages 
on i386-win32 under the options "-O4 -CriotR".  Doing so gives a failure 
with the vcl_compat package (the failure also occurs with just 
"-CriotR").  Can others confirm?


External command 
"C:/Users/garet/Documents/programming/fpc-opts/compiler/ppc386.exe 
-Twin32 -FUvcl-compat\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\rtl\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-base\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\rtl-objpas\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-xml\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-web\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-db\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\rtl-extra\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\ibase\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\mysql\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\odbc\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\oracle\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\postgres\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\sqlite\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\dblib\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\pxlib\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-json\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-fpcunit\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\paszlib\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\hash\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\libtar\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-net\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-passrc\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-process\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-hash\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fcl-registry\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\openssl\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\fastcgi\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\httpd22\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\httpd24\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\winunits-base\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\libmicrohttpd\units\i386-win32\ 
-FuC:\Users\garet\Documents\programming\fpc-opts\packages\rtl-generics\units\i386-win32\ 
-Fuvcl-compat\src -Fivcl-compat\src -Ur -Xs -O2 -n -O4 -CriotR -di386 
-dRELEASE -XX -CX -Sc -viq vcl-compat\BuildUnit_vcl_compat.pp" failed 
with exit code 1. Console output:

Target OS: Win32 for i386
Compiling vcl-compat\BuildUnit_vcl_compat.pp
Compiling .\vcl-compat\src\system.permissions.pp
Compiling .\vcl-compat\src\system.messaging.pp
Compiling .\vcl-compat\src\system.netencoding.pp
Writing Resource String Table file: system.netencoding.rsj
Compiling .\vcl-compat\src\system.ioutils.pp
Writing Resource String Table file: system.ioutils.rsj
Compiling .\vcl-compat\src\system.devices.pp
Compiling .\vcl-compat\src\system.analytics.pp
Compiling .\vcl-compat\src\system.ansistrings.pp
Compiling .\vcl-compat\src\system.imagelist.pp
Compiling .\vcl-compat\src\system.diagnostics.pp
Compiling .\vcl-compat\src\system.notification.pp
Compiling .\vcl-compat\src\system.json.pp
Writing Resource String Table file: system.json.rsj
Compiling .\vcl-compat\src\system.pushnotifications.pp
Writing Resource String Table file: system.pushnotifications.rsj
Compiling .\vcl-compat\src\system.hash.pp
Writing Resource String Table file: system.hash.rsj
Compiling .\vcl-compat\src\system.credentials.pp
Writing Resource String Table file: system.credentials.rsj
Compiling .\vcl-compat\src\system.threading.pp
system.threading.pp(3953,51) Error: Incompatible type for arg no. 2: Got 
"Class Of TAbstractTask.IInternalTask", expected "TClass"
system.threading.pp(5030) Fatal: There were 1 errors compiling module, 
stopping

Fatal: Compilation aborted

The installer encountered the following error:
Compilation of "BuildUnit_vcl_compat.pp" failed
  $00495903
  $0049D1E2
  $0049CADC
  $0049D878
  $00494A39
make[2]: *** [smart] Error 1
make[2]: Leaving directory 
`C:/Users/garet/Documents/programming/fpc-opts/packages'

make[1]: *** [packages_smart] Error 2
make[1]: Leaving directory

Re: [fpc-devel] ARM: AND/CMP -> TST optimisation produces incorrect results

2024-02-28 Thread J. Gareth Moreton via fpc-devel


Hi Garry,

Hopefully I have fixed this issue now, which is also causing problems 
elsewhere.


https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/598 - just 
waiting on it to be verified, approved and merged.


Gareth aka. Kit

On 20/02/2024 06:32, J. Gareth Moreton via fpc-devel wrote:


Thanks for the report and especially your investigative work. Ii'll 
take a look to see what's going on.


Gareth aka. Kit

On 20/02/2024 01:30, Garry Wood via fpc-devel wrote:


Hello,

Commit 6b2e4fa4 (main) entitled “* arm: "OpCmp2OpS" moved to Pass 2 
so it doesn't conflict with AND; CMP -> TST optimisation” by Gareth 
from Feb 11 2024 produces incorrect assembler in certain cases.


https://gitlab.com/freepascal.org/fpc/source/-/commit/6b2e4fa4133a496c1c3f89e3c71fffbdd7c192fb

This piece of code:

function CPUMaskCount(CPUMask:LongWord):LongWord;

var

Count:LongWord;

begin

{}

Result:=0;

 for Count:=CPU_ID_0 to CPU_ID_MAX do

  begin

   if (CPUMask and (1 shl Count)) <> 0 then

    begin

 Inc(Result);

    end;

  end;

end;

when compiled with FPC prior to commit 6b2e4fa4 produces the 
following working assembler:


00020528 :

   20528: e1a01000    mov   r1, r0

2052c:   e3a0    mov   r0, #0

   20530: e3a02000    mov   r2, #0

   20534: e3a03001    mov   r3, #1

   20538: e0113213   ands  r3, r1, r3, lsl r2

2053c:   1281   addne   r0, r0, #1

   20540: e2822001   add    r2, r2, #1

   20544: e352001f    cmp   r2, #31

   20548: 9af9 bls  20534 



2054c:   e12fff1e   bx lr

But when compiled with FPC after commit 6b2e4fa4 it produces this 
assembler which doesn’t work:


00020528 :

   20528: e1a01000    mov   r1, r0

2052c:   e3a0    mov   r0, #0

   20530: e3a02000    mov   r2, #0

   20534: e3a03001    mov   r3, #1

   20538: e1110003   tst   r1, r3

2053c:   1281   addne   r0, r0, #1

   20540: e2822001   add    r2, r2, #1

   20544: e352001f    cmp   r2, #31

   20548: 9af9 bls  20534 



2054c:   e12fff1e   bx lr

You can see that the difference is the lack of lsl r2 on the end of 
the TST instruction which means that the shl on the original code is 
not being performed and the test is therefore invalid.


Similar code sequences in multiple other places produce the same 
result with the lsl suffix missing from the TST instruction.


Please let me know if you need any further information.

Garry Wood.


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Possible bug in "chmreader"

2024-02-21 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

While evaluating a new peephole optimisation, I came across a null 
pointer dereference in the assembly language.  After looking at the 
original Pascal code, I came across this starting at line 525 of 
packages/chm/src/chmreader.pas:


procedure TChmReader.ReadWindows(mem:TMemoryStream);

var
  i,cnt,
  version   : integer;
  x : TChmWindow;
begin
 if not assigned(fwindowslist) then
 fWindowsList.Clear;
 mem.Position:=0;
 ...

This code looks very suspicious to me because it calls 
fWindowsList.Clear only if fWindowsList is a null pointer.  This will 
instantly cause an access violation (Clear is not a class method).


Without the new optimisation, this is what the x86_64-win64 assembly 
language looks like:


    cmpq    $0,280(%rbx)
    jne    .Lj189
    movq    280(%rbx),%rcx
    movq    (%rcx),%rax
    call    *216(%rax)
.Lj189:

If JNE doesn't branch, then the value at 280(%rbx) is zero, and this is 
then copied into %rcx, then the value referenced by %rcx is stored in 
%rax, however because the value at 280(%rbx) is zero, then %rcx is also 
zero and (%rcx) is a null pointer dereference.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] ARM: AND/CMP -> TST optimisation produces incorrect results

2024-02-19 Thread J. Gareth Moreton via fpc-devel

Thanks for the report and especially your investigative work. Ii'll take 
a look to see what's going on.


Gareth aka. Kit

On 20/02/2024 01:30, Garry Wood via fpc-devel wrote:


Hello,

Commit 6b2e4fa4 (main) entitled “* arm: "OpCmp2OpS" moved to Pass 2 so 
it doesn't conflict with AND; CMP -> TST optimisation” by Gareth from 
Feb 11 2024 produces incorrect assembler in certain cases.


https://gitlab.com/freepascal.org/fpc/source/-/commit/6b2e4fa4133a496c1c3f89e3c71fffbdd7c192fb

This piece of code:

function CPUMaskCount(CPUMask:LongWord):LongWord;

var

Count:LongWord;

begin

{}

Result:=0;

 for Count:=CPU_ID_0 to CPU_ID_MAX do

  begin

   if (CPUMask and (1 shl Count)) <> 0 then

    begin

 Inc(Result);

    end;

  end;

end;

when compiled with FPC prior to commit 6b2e4fa4 produces the following 
working assembler:


00020528 :

   20528: e1a01000    mov   r1, r0

   2052c: e3a0    mov   r0, #0

   20530: e3a02000    mov   r2, #0

   20534: e3a03001    mov   r3, #1

   20538: e0113213   ands  r3, r1, r3, lsl r2

   2053c: 1281   addne   r0, r0, #1

   20540: e2822001   add    r2, r2, #1

   20544: e352001f    cmp   r2, #31

   20548: 9af9 bls  20534 



   2054c: e12fff1e   bx   lr

But when compiled with FPC after commit 6b2e4fa4 it produces this 
assembler which doesn’t work:


00020528 :

   20528: e1a01000    mov   r1, r0

   2052c: e3a0    mov   r0, #0

   20530: e3a02000    mov   r2, #0

   20534: e3a03001    mov   r3, #1

   20538: e1110003   tst   r1, r3

   2053c: 1281   addne   r0, r0, #1

   20540: e2822001   add    r2, r2, #1

   20544: e352001f    cmp   r2, #31

   20548: 9af9 bls  20534 



   2054c: e12fff1e   bx   lr

You can see that the difference is the lack of lsl r2 on the end of 
the TST instruction which means that the shl on the original code is 
not being performed and the test is therefore invalid.


Similar code sequences in multiple other places produce the same 
result with the lsl suffix missing from the TST instruction.


Please let me know if you need any further information.

Garry Wood.


___
fpc-devel maillist  -fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Compiler warning when built with -dDEBUG_NODE_XML

2024-02-14 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

After some recent updates to the trunk, the compiler no longer 
successfully builds when -dDEBUG_NODE_XML is specified:


symsym.pas(2885,9) Warning: (treated as error) Case statement does not 
handle all possible cases


This is located within "procedure TConstSym.XMLPrintConstData(var T: 
Text);", which outputs information on defined constants into the node 
dump.  I assume this is because of commit 
fe62b3ace8c237d8bd1800beb5969e5cb540723f (Michael Van Canneyt's work) 
which introduces the "constwresourcestring" constant type, which isn't 
handled by the identified case block.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Modifiers...

2024-01-24 Thread J. Gareth Moreton via fpc-devel

Note that for 1), it should be "end.", not "end;" - make sure that isn't 
causing your error.


Subroutine directives like "vectorcall" I think are usable as variable 
names.


Kit

On 24/01/2024 22:29, Martin Frb via fpc-devel wrote:

https://www.freepascal.org/docs-html/ref/refsu3.html

Is this list complete/correct?

1)
It lists bitpacked, but
    program foo; var  bitpacked: integer;  begin end;
gives an error.

I thought modifiers can be used as var names?

2)
Is there, or has there once been?
(found in the synedit highlighter)
  final
  automated
  optional


3)
Not on the list, but suspected the should?
Some are on https://www.freepascal.org/docs-html/ref/refse100.html
  sealed
  inline
  mwpascal
  noinline
  weakexternal
  compilerproc
  vectorcall




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Internal "Signed wrap" function

2023-12-01 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I'm just fixing a bugged optimisation, and I came across a situation 
where I need to essentially do a "signed wrap" of a constant.  At one 
point I'm optimising "subl $-12,%eax", "addl $1,%eax" into a single 
instruction, but to make sure (untrapped) overflows are handled 
correctly, the calculated constant has to be wrapped modulo 32 (since 
%eax is a 32-bit constant).  Normally this is as simple as performing a 
bitwise AND against $, but it produces the number 4294967285 
rather than the more compact -11 (this is because the operands are 
stored as 64-bit integers within the compiler).  Storing the smaller 
negative number will most likely result in a smaller number of bytes in 
the final machine code as well as avoid any potential assembler errors.


Currently I'm just using a local function to perform the sign extension 
of arbitrary length (since I have to extend either an 8-bit, 16-bit or 
32-bit value depending on the operand size). Does the compiler contain 
such a function already in a utility unit, and if not, would one be 
welcomed?


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Accidental file inclusion in repository

2023-11-29 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I hate to point fingers, but there's a 0-byte file named "HEAD" in the 
repository, which causes git to throw a tantrum sometimes - it was 
introduced in the following commit:


commit a4c324ee237674950e4675894df386519b75a130
Author: Rika Ichinose 
Date:   Fri Apr 14 09:24:55 2023 +0300

    Fill* for x64, physically sharing half of the code with FillChar.

Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Interesting short article about optimisation

2023-11-25 Thread J. Gareth Moreton via fpc-devel

I just stumbled across this article about micro-architecture-specific 
optimisations in ARM: 
https://www.phoronix.com/news/ARM64-Linux-No-Uarch-Opts


They briefly mention x86_64, and I agree it's good to avoid 
micro-architecture-specific optimisations and now it makes me wonder 
where the line is drawn, like the recent introduction of the optimizer 
hint regarding LEA instructions being slow on some middle-aged CPUs.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Arm compiler limitation

2023-11-23 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

So one of my recent merge requests 
(https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/516) has 
been having test failures on arm-linux, and this has confused me for a 
while because they don't occur on aarch64-linux (the two platforms share 
the same code in this particular case).  Well, after a lot of debugging 
and random shots in the dark, I've found the problem.


RegModifiedByInstruction and RegInInstruction doesn't handle 
NR_DEFAULTFLAGS on the Arm implementation - it always returns False, so 
calling RegUsedBetween or RegModifiedBetween for the flags register 
always returns a false negative.  The other fault is that it doesn't 
check if the input object is an instruction (there's no "if p.typ <> 
ait_instruction then Exit;" check).


I guess taking a step back and taking a breath can help find the 
solution, so I'll be fixing this with the merge request.


Kit

P.S. The above compiler methods are only used by the peephole optimizer, 
and Arm's peephole optimizer is relatively basic currently.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Quirk is "IsJumpToLabel"

2023-11-10 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I've been developing a new optimisation for x86, and in one situation a 
JMP becomes a Jcc.  To make sure it's valid, I ensure that 
"IsJumpToLabel" returns True before the change is made.  All was well in 
x86_64-win64 and x86_64-linux, but on i386-linux, I came across a bit of 
an anomaly:


    jmp _$RTTI$_Ld3(,%eax,4)

It turns out that "IsJumpToLabel" returns true for this construct, which 
is not valid for Jcc.  _$RTTI$_Ld3 is a jump table stored as a data 
structure.  The question is though... should this be treated as a jump 
to a label?


Currently, my optimisation fails on i386-linux because of 
"IsJumpToLabel" returning True on this.  I can modify my code so it 
makes sure there's no index register, but this feels a bit hacky and 
there may be other, unrelated blocks of code that could fall foul of a 
similar situation, and I personally feel that "IsJumpToLabel" should 
return True only for pure labels.  However, such a change will affect 
other platforms and I don't yet know what effect that will have.


As a side-node, because of the principle of relocation under x86_64, 
jump table access is more complex.  The equivalent code for "jmp 
_$RTTI$_Ld3(,%eax,4)" on x86_64-win64 is:


    leaq    .Ld3(%rip),%rdx
    movslq    (%rdx,%rax,4),%rax
    addq    %rdx,%rax
    jmp    *%rax

(The jump table contains relative addresses rather than absolute 
addresses, hence the need for "addq %rdx,%rax")


And of course, the JMP instruction is not considered a jump to a label.

Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Kit's current work

2023-11-08 Thread J. Gareth Moreton via fpc-devel



On 08/11/2023 21:11, Florian Klämpfl via fpc-devel wrote:

Am 08.11.2023 um 21:22 schrieb J. Gareth Moreton via 
fpc-devel:

- I don't know what the eventual support for intrinsics will be for FPC, if it 
will ever get implemented, but I at the very least hope the internal nodes will 
be implemented some day,

Aren’t they already? Not working well though …
I'll have to double-check.  Last I recall they weren't available, but I 
haven't looked in that area for a while.



since compiler developers can then use them directly in the node pass for 
vectorisation and the like.  For one thing, once they are implemented for 
x86_64 (and maybe AArch64 too), I would like to see if I can adapt the uComplex 
unit to support vectorisation.

I would pretty much see the uComplex unit being autovectorized instead.


That was my intention... using uComplex and similar units as a test case 
to detect auto-vectorisation.


Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Kit's current work

2023-11-08 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I just thought I'd give a heads up on what I'm currently doing for the 
Free Pascal Compiler.


- Pure functions are still my main target.  There are a few sticking 
points that I'm trying to resolve, like handling certain internal 
functions and how to deal with out variables that get passed into nested 
pure functions (the sym lists don't permit easy optimisation in this case).


- I'm overhauling the CMOV generation code for x86 since it's quite a 
big convoluted mess.  Besides "outlining" the code, I want to refactor 
it so it's cleaner and more portable, as I intend to reuse it for the 
AArch64 peephole optimizer (using CSEL as opposed to CMOV).  I've found 
some new related optimisations as well but there are still some bugs I'm 
trying to fix.


- Speaking of AArch64, I'm developing more optimisations for it, both at 
the node level and the peephole level.


- I don't know what the eventual support for intrinsics will be for FPC, 
if it will ever get implemented, but I at the very least hope the 
internal nodes will be implemented some day, since compiler developers 
can then use them directly in the node pass for vectorisation and the 
like.  For one thing, once they are implemented for x86_64 (and maybe 
AArch64 too), I would like to see if I can adapt the uComplex unit to 
support vectorisation.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Optimisation question

2023-10-30 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I'm still exploring optimisations in generated x86 code, something which 
has become my speciality, and I found one new potential optimisation 
sequence that aims to reduce unnecessary calls to CMP and TEST when the 
result is already known.  However there are some situations where I'm 
not sure if the end result is better or not.  For example (taken from 
the cgiprotocol unit):


.Lj5:
    subl    $1,%esi
.Lj6:
    testl    %esi,%esi
    jng    .Lj9
    ...
    testl    %eax,%eax
    jne    .Lj5
.Lj9:
    movl    $-1,%eax
    testl    %esi,%esi
    cmovlel    %eax,%esi
    movl    %esi,%eax

In this instance, if the first TEST instruction results in "jng .Lj9" 
branching, it soon calls another TEST instruction with the same 
register, which will not have changed value, so CMOVLE will definitely 
set %esi to %eax (equal to -1) because "le" is equal to "ng" ("less than 
or equal" versus "not greater"), and then %esi is written back to %eax.  
All in all, this is a dependency chain of length 3.  When my new 
optimisation takes place, the following is generated instead:


.Lj5:
    subl    $1,%esi
.Lj6:
    testl    %esi,%esi
    jng    .Lj9
    ...
    testl    %eax,%eax
    jne    .Lj5
    testl    %esi,%esi
    jnle    .Lj12
.Lj9:
    movl    $-1,%esi
.Lj12:
    movl    %esi,%eax

Here, a number of other optimisations don't take place (hence why the 
CMOV instruction no longer exists), but now if "jng .Lj9" branches, it 
sets %esi to -1 directly before writing the result to %eax... a 
dependency chain of just 2.  However, there is also an extra conditional 
jump, which can no longer be optimised into CMOV because of the position 
of the .Lj9 label.


The question is... is the new code better, about the same or worse?  On 
modern processors most, TEST/Jcc can be macro-fused, but jumps always 
cause some penalty.


There are cases where the optimisation gives clear benefits - for 
example, in the cclasses unit - before:


    movq    %rcx,%rdx
    testq    %rcx,%rcx
    je    .Lj382
    movq    -8(%rcx),%rdx
.Lj382:
    testq    %rcx,%rcx
    jne    .Lj383
    leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
    call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

After:

    movq    %rcx,%rdx
    testq    %rcx,%rcx
    je    .Lj382
    movq    -8(%rcx),%rdx
    jmp    .Lj383
.Lj382:
    leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
    call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

In this case, if "je .Lj382" branches, then %rcx is definitely zero and 
so control flow can safely skip straight to the LEA instruction, since 
the "jne .Lj383" instruction will definitely not branch.  On the other 
hand, if "je .Lj382" does not branch, then %rcx is definitely non-zero 
and so the second "testq %rcx,%rcx" is once again deterministic, meaning 
"jne .Lj383" will definitely branch, therefore the second TEST 
instruction can be removed completely and the conditional jump turned 
into an unconditional jump.  The end result is that the number of jumps 
hasn't changed (exactly one jump will be taken regardless of the value 
of %rcx), but the instruction count has been reduced (note that %rdx 
can't be optimised out because its value is used as an actual parameter 
for the CALL).


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-27 Thread J. Gareth Moreton via fpc-devel


I should have figured.  Thank you!

Kit

On 27/10/2023 01:51, Nikolay Nikolov via fpc-devel wrote:


On 10/11/23 11:21, Tomas Hajny via fpc-devel wrote:

On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote:

Sweet, thank you.  Would you be willing to share your modified test's
source? I was worried that if CPUID wasn't present it would cause a
SIGILL.


Sure, attached, but I didn't do anything special - I modified it in a 
way allowing easy disabling of this detection for x86 by disabling 
definition of a conditional symbol added to the source and I was 
prepared to recompile with the functionality disabled on the old AMD 
DX4 if needed. However, I didn't need to do so - the AMD DX4 machine 
simply ignored it and chose the branch used in case of missing 
support for the particular CPUID function. I have no idea if this 
might be due to some protection in OS/2 Warp 4 (used for compiling 
and running the test on that machine) potentially masking that 
exception, or what was the reason. Apparently, it should be possible 
to detect CPUID availability (albeit not 100% reliably), see 
https://wiki.osdev.org/CPUID, but I didn't use that.


There's CPUID support detection code in the Free Pascal RTL for i8086 
and i386. It's in unit cpu:


function cpuid_support: boolean;

Nikolay



Tomas




On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well 
if possible and added an IFDEFed distinction of 32-bits versus 
64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel

It was a thought that crossed my mind when Stefan pointed out the 
translated Google Benchmark, but given that it hasn't yet been adapted 
to work outside of i386 and x86_64, you are right that it probably 
shouldn't be used for the time being.  The framework uses CPU timings to 
decide how many iterations to run, so obtaining such metrics is 
essential for its operation.  While I can probably get it to work for 
aarch64-linux, I don't know the first thing about polling CPUs on 
platforms I don't have access to!  It was a nice experiment in the meantime.


In regards to "blea" being in the test suite, i haven't yet put it into 
the normal test suite (using an include wrapper like I did with 
"tests/bench/bcase.pp") since it's primarily to evalutate LEA timings 
rather than testing compiler efficiency.  It's more of a 'utility' 
test.  The feedback from others has proven useful in determining the 
correctness of the new optimisation hint, which I intend to use to make 
the i386/x86_64 peephole optimizer smarter in regards to using LEA 
statements.


Kit

On 13/10/2023 16:36, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 17:08, J. Gareth Moreton via fpc-devel wrote:

Interesting!  That's a bug report to send to the maintainers of the
framework.  I'll need to have them fix it before I'd be willing to try
again with its use in FPC.

Removed the reference.  Apologies - I'm rushing a bit.


BTW, it's IMHO questionable whether a benchmark framework restricted 
to just a subset of targets supported for the given architecture 
should be really used within FPC source codes (if that's your 
potential intention anyway). If it was intended for the testsuite, it 
would immediately fail at compile time when checking the testsuite 
under some other target, because the test doesn't specify that its use 
should be restricted to certain targets (it's only restricted for i386 
and x86_64 regardless of the operating system - that's fine for the 
original version, but not for the benchmark framework).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel


This one's for you Stefan!

https://github.com/spring4d/benchmark/issues/4

Kit

On 13/10/2023 16:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote:

GetLogicalProcessorInformation returns a Boolean - if false, an error
occurred, and is handled as follows:

DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation:
' + GetLastError.ToString);

GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework,
I'm going to remove it from "blea" for now.  As it stands, please find
attached the test that appears in the merge request:
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


The attached version still contained reference to the framework.

The problem with 32-bit compilation of the framework was due to a 
missing stdcall calling convention in the 
GetLogicalProcessorInformation declaration.


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel

Interesting!  That's a bug report to send to the maintainers of the 
framework.  I'll need to have them fix it before I'd be willing to try 
again with its use in FPC.


Removed the reference.  Apologies - I'm rushing a bit.

Kit

On 13/10/2023 16:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote:

GetLogicalProcessorInformation returns a Boolean - if false, an error
occurred, and is handled as follows:

DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation:
' + GetLastError.ToString);

GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework,
I'm going to remove it from "blea" for now.  As it stands, please find
attached the test that appears in the merge request:
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


The attached version still contained reference to the framework.

The problem with 32-bit compilation of the framework was due to a 
missing stdcall calling convention in the 
GetLogicalProcessorInformation declaration.


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef DETECTCPU}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$ifdef CPUX86_64}
  PUSH RBX
{$else CPUX86_64}
  PUSH EBX
{$endif CPUX86_64}
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
{$ifdef CPUX86_64}
  LEA  R8,  [RIP + CPUName]
{$endif CPUX86_64}
  MOV  EAX, $8002
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
{$else CPUX86_64}
  MOV  [CPUName], EAX
  MOV  [CPUName + 4], EBX
  MOV  [CPUName + 8], ECX
  MOV  [CPUName + 12], EDX
{$endif CPUX86_64}
  MOV  EAX, $8003
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
{$else CPUX86_64}
  MOV  [CPUName + 16], EAX
  MOV  [CPUName + 20], EBX
  MOV  [CPUName + 24], ECX
  MOV  [CPUName + 28], EDX
{$endif CPUX86_64}
  MOV  EAX, $8004
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
{$else CPUX86_64}
  MOV  [CPUName + 32], EAX
  MOV  [CPUName + 36], EBX
  MOV  [CPUName + 40], ECX
  MOV  [CPUName + 44], EDX
  MOV  BYTE PTR [CPUName + 48], 0
{$endif CPUX86_64}
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
{$ifdef CPUX86_64}
  POP  RBX
{$else CPUX86_64}
  POP  EBX
{$endif CPUX86_64}
end;
{$else DETECTCPU}
function FillBrandName: Boolean; inline;
begin
  Result := False;  
end;
{$endif DETECTPU}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X - 2023406815] {+$87654321 in decimal}
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  Write(name, ': ');
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 10);
  time := Now - start) * SecsPerDay) * 1e9) / internal_reps) / reps;
  WriteLn(time:0:(2 * ord(time < 10)), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
{$ifdef CPUX86_64}
  Write('64-bit');
{$else CPUX86_64}
  Write('32-bit');
{$endif CPUX86_64}
  if FillBrandName then
begin
  WriteLn(' CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  WriteLn('--', CpuName);
end;
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Resul

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel

GetLogicalProcessorInformation returns a Boolean - if false, an error 
occurred, and is handled as follows:


DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation: ' 
+ GetLastError.ToString);


GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework, I'm 
going to remove it from "blea" for now.  As it stands, please find 
attached the test that appears in the merge request: 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


Kit

On 13/10/2023 08:34, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 09:26, Tomas Hajny wrote:

On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote:

So an update.

 .
 .

The latest version of blea.pp doesn't compile with a 32-bit compiler -
line 76 contains an unconditional reference to R8 register, which
obviously doesn't for the 32-bit mode.


BTW, the line shouldn't be necessary at all, because global variables 
should be initialized to 0 on program start anyway as far as I know.


When fixing the problem above, compiling to 32-bit mode and running 
it, the test fails with an error in GetLogicalProcessorInformation (it 
states "8" in place of the error information; I wonder if it isn't 
misinterpreted, because 8 is number of logical CPUs on the machine 
used for running the test).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils, Spring.Benchmark in 'spring/Spring.Benchmark.pp';
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef DETECTCPU}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$ifdef CPUX86_64}
  PUSH RBX
{$else CPUX86_64}
  PUSH EBX
{$endif CPUX86_64}
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
{$ifdef CPUX86_64}
  LEA  R8,  [RIP + CPUName]
{$endif CPUX86_64}
  MOV  EAX, $8002
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
{$else CPUX86_64}
  MOV  [CPUName], EAX
  MOV  [CPUName + 4], EBX
  MOV  [CPUName + 8], ECX
  MOV  [CPUName + 12], EDX
{$endif CPUX86_64}
  MOV  EAX, $8003
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
{$else CPUX86_64}
  MOV  [CPUName + 16], EAX
  MOV  [CPUName + 20], EBX
  MOV  [CPUName + 24], ECX
  MOV  [CPUName + 28], EDX
{$endif CPUX86_64}
  MOV  EAX, $8004
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
{$else CPUX86_64}
  MOV  [CPUName + 32], EAX
  MOV  [CPUName + 36], EBX
  MOV  [CPUName + 40], ECX
  MOV  [CPUName + 44], EDX
  MOV  BYTE PTR [CPUName + 48], 0
{$endif CPUX86_64}
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
{$ifdef CPUX86_64}
  POP  RBX
{$else CPUX86_64}
  POP  EBX
{$endif CPUX86_64}
end;
{$else DETECTCPU}
function FillBrandName: Boolean; inline;
begin
  Result := False;  
end;
{$endif DETECTPU}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X - 2023406815] {+$87654321 in decimal}
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  Write(name, ': ');
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 10);
  time := Now - start) * SecsPerDay) * 1e9) / internal_reps) / reps;
  WriteLn(time:0:(2 * ord(time < 10)), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
  if FillBrandName then
begin
  WriteLn('CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  WriteLn('--', CpuName);
end;
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500,

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel

Oops - that was a silly mistake of mine with R8.  As for the other 
error, that sounds like it's in the third party benchmark suite.  I'll 
do some investigating on my virtual machine.


In the meantime, here's the fixed test with the stray R8 call properly 
filtered out on i386 (it's replaced with "CPUName" on 32-bit).  I wasn't 
sure if global variables were initialised or not, hence me playing safe.


Kit

On 13/10/2023 08:34, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 09:26, Tomas Hajny wrote:

On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote:

So an update.

 .
 .

The latest version of blea.pp doesn't compile with a 32-bit compiler -
line 76 contains an unconditional reference to R8 register, which
obviously doesn't for the 32-bit mode.


BTW, the line shouldn't be necessary at all, because global variables 
should be initialized to 0 on program start anyway as far as I know.


When fixing the problem above, compiling to 32-bit mode and running 
it, the test fails with an error in GetLogicalProcessorInformation (it 
states "8" in place of the error information; I wonder if it isn't 
misinterpreted, because 8 is number of logical CPUs on the machine 
used for running the test).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils, Spring.Benchmark in 'spring/Spring.Benchmark.pp';
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef DETECTCPU}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$ifdef CPUX86_64}
  PUSH RBX
{$else CPUX86_64}
  PUSH EBX
{$endif CPUX86_64}
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
{$ifdef CPUX86_64}
  LEA  R8,  [RIP + CPUName]
{$endif CPUX86_64}
  MOV  EAX, $8002
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
{$else CPUX86_64}
  MOV  [CPUName], EAX
  MOV  [CPUName + 4], EBX
  MOV  [CPUName + 8], ECX
  MOV  [CPUName + 12], EDX
{$endif CPUX86_64}
  MOV  EAX, $8003
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
{$else CPUX86_64}
  MOV  [CPUName + 16], EAX
  MOV  [CPUName + 20], EBX
  MOV  [CPUName + 24], ECX
  MOV  [CPUName + 28], EDX
{$endif CPUX86_64}
  MOV  EAX, $8004
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
{$else CPUX86_64}
  MOV  [CPUName + 32], EAX
  MOV  [CPUName + 36], EBX
  MOV  [CPUName + 40], ECX
  MOV  [CPUName + 44], EDX
  MOV  BYTE PTR [CPUName + 48], 0
{$endif CPUX86_64}
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
{$ifdef CPUX86_64}
  POP  RBX
{$else CPUX86_64}
  POP  EBX
{$endif CPUX86_64}
end;
{$else DETECTCPU}
function FillBrandName: Boolean; inline;
begin
  Result := False;  
end;
{$endif DETECTPU}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X -2023406815] { -2023406815 = $87654321 }
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

const
  internal_reps = 1000;

procedure BM_Checksum_PAS(const State: TState);
var
  S: TState.TValue; Z, X: LongWord;
begin
  Z := 500;
  X := 1000;
  for S in State do
  begin
Checksum_PAS(Z, X, internal_reps);
  end;
end;

procedure BM_Checksum_LEA(const State: TState);
var
  S: TState.TValue; Z, X: LongWord;
begin
  Z := 500;
  X := 1000;
  for S in State do
  begin
Checksum_LEA(Z, X, internal_reps);
  end;
end;

procedure BM_Checksum_ADD(const State: TState);
var
  S: TState.TValue; Z, X: LongWord;
begin
  Z := 500;
  X := 1000;
  for S in State do
  begin
Checksum_ADD(Z, X, internal_reps);
  end;
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
{$IFDEF CPUX86}
  WriteLn ('32 bits:');
{$ENDIF CPUX86}
{$IFDEF CPUX86_64}
  WriteLn ('64 bits:');
{$ENDIF CPUX86_64}
  if FillBrandName then
begin
  WriteLn('CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel


So an update.

I've added Spring.Benchmark to "tests/bench/spring" on my local branch, 
along with its readme and licence file.  It seems to work quite well 
even if it feels a bit like overkill for this small a benchmark.  Still, 
I've attached the version with Stefan's translated Google Benchmark unit 
to see what people think.  A couple of things to note:


 * Time metrics are now in thousands of nanoseconds because the 1,000
   repetitions of the internal loop (used to drown out the overhead of
   the function call) are no longer divided out.
 * Requires the fcl-base, rtl-objpas and regexpr packages.

I also made a mistake with the compiler flags.  I had added 
CPUX86_HINT_FAST_3COMP_ADDR_16 to indicate that a LEA instruction with 
16-bit operands is fast, since the timing is often different to the 
32/64-bit versions.  However, under i386 and x86_64, the assembler 
doesn't accept 16-bit operands!  I have therefore removed it for i386 
and x86_64, although I left it in for i8086 (even though it probably 
won't be used) because the Pentium 4 has a slow 16-bit LEA instruction.


However, the proposed COREX CPU option now has the exact same flags as 
ZEN3.  Should it be removed, or kept for clarity and future expansion?


Kit

P.S. Sorry for the size of the ZIP.
<>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] 47k attachment

2023-10-12 Thread J. Gareth Moreton via fpc-devel


To whom it may concern,

I have a new message for the "LEA instruction speed" chain, but it is 
currently in holding as it contains a 47k ZIP file (source code only, 
and a third-party licence agreement).  Can the mailing list maintainer 
confirm (or deny) that it's okay?


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel

The LEA and ADD times are close enough that I can consider them 
identical.  And Braswell (the architecture behind that brand of Celeron) 
doesn't support AVX, I don't think, so that lines up with COREI having a 
fast LEA instruction but not COREAVX.


Given the many different x86-compatible CPUs, I wonder if we need to 
document the best compiler parameters for end users in some way (e.g. so 
it can be coded in a device driver installer so the most optimised 
binary can be installed for a given CPU architecture).


Kit

On 11/10/2023 05:56, Christo Crause wrote:

On Tue, Oct 10, 2023 at 11:13 AM J. Gareth Moreton via fpc-devel
 wrote:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I
had reduced it to 10,000 because, coupled with the 1,000 iterations in
the subroutines themselves, would have led to 1,000,000,000 passes and
hence would take in the region of five to ten minutes to complete for a
16 MHz 386, for example.  Rika's suggestion of running as many
iterations as needed until, say, 5 seconds elapses, would help but the
timing measurements would cause a lot of latency and will be imprecise
on very slow routines.  Still, let's see if 100,000 gives better results
for you.

Kit

Results on a modest CPU:

CPU =   Intel(R) Celeron(R) CPU  N3050  @ 1.60GHz
-
Pascal control case: 6.71 ns/call
  Using LEA instruction: 2.09 ns/call
Using ADD instructions: 2.05 ns/call

32 bits:
Pascal control case: 6.78 ns/call
  Using LEA instruction: 2.16 ns/call
Using ADD instructions: 2.09 ns/call

Results show a bit of variance, above numbers are more or less typical.

Christo


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel

Sweet, thank you.  Would you be willing to share your modified test's 
source? I was worried that if CPUID wasn't present it would cause a SIGILL.


Kit

On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well if 
possible and added an IFDEFed distinction of 32-bits versus 64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas





On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:

Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), 
but I had reduced it to 10,000 because, coupled with the 1,000 
iterations in the subroutines themselves, would have led to 
1,000,000,000 passes and hence would take in the region of five to 
ten minutes to complete for a 16 MHz 386, for example.  Rika's 
suggestion of running as many iterations as needed until, say, 5 
seconds elapses, would help but the timing measurements would 
cause a lot of latency and will be imprecise on very slow 
routines.  Still, let's see if 100,000 gives better results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call


Indeed, it's much more consistent now, attached a new log for both 
32-bit and 64-bit versions from the Intel machine with Windows. 
Apparently, ADD is still somewhat faster on such "newer" Intel 
machines (at least if not considering the potential parallelism of 
LEA discussed previously). I can try this version on my AMD machines 
later tonight if considered useful - please, let me know which 
results would be relevant for you in that case (out of the ancient 
AMD DX4, only slightly less ancient AMD Athlon 1 GHz and the still 
rather reasonable AMD A9).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel

I'm all for receiving results for all kinds of processor, as it helps me 
to make more informed choices on flags as well as confirming that Agner 
Fog''s instruction tables are correct. Also, results for older 
processors can be hard to come by sometimes.


Currently, most architectures have a fast LEA, and the default "Athlon" 
option lines up with this.  Of the Intel architectures, the speed slows 
down on COREAVX onwards (COREI is fine), so I added a new COREX (for 
10th generation Core) option between ZEN2 and ZEN3 to mark the point 
where LEA is fast again (its 16-bit version is also fast, unlike Zen 3).


In the meantime I'll be looking at the benchmarking code that Stefan 
provided to see if it can and should be integrated.


Thanks again everyone for the results you're giving.

Kit

P.S. In regards to parallelisation in having LEA instructions running 
alongside other arithmetic/logical operations, that will be an 
interesting field of research.  At the very least, the post-peephole 
stage can change ADD or SUB into a LEA if using an AGU over an ALU 
appears to give a micro-optimisation.  It also benefits hyperthreading, 
as the ALUs tend to be very heavily used, while AGUs tend to be used one 
at a time.


On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:

Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but 
I had reduced it to 10,000 because, coupled with the 1,000 
iterations in the subroutines themselves, would have led to 
1,000,000,000 passes and hence would take in the region of five to 
ten minutes to complete for a 16 MHz 386, for example.  Rika's 
suggestion of running as many iterations as needed until, say, 5 
seconds elapses, would help but the timing measurements would cause 
a lot of latency and will be imprecise on very slow routines.  
Still, let's see if 100,000 gives better results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call


Indeed, it's much more consistent now, attached a new log for both 
32-bit and 64-bit versions from the Intel machine with Windows. 
Apparently, ADD is still somewhat faster on such "newer" Intel 
machines (at least if not considering the potential parallelism of LEA 
discussed previously). I can try this version on my AMD machines later 
tonight if considered useful - please, let me know which results would 
be relevant for you in that case (out of the ancient AMD DX4, only 
slightly less ancient AMD Athlon 1 GHz and the still rather reasonable 
AMD A9).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel


Ooo, that might be just what we need.  Thank you Stefan.

Kit

On 10/10/2023 10:57, Stefan Glienke via fpc-devel wrote:

Be my guest making https://github.com/spring4d/benchmark compatible for all 
platforms you need it for.


On 10/10/2023 11:13 CEST J. Gareth Moreton via fpc-devel 
 wrote:

  
Thanks Tomas,


Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I
had reduced it to 10,000 because, coupled with the 1,000 iterations in
the subroutines themselves, would have led to 1,000,000,000 passes and
hence would take in the region of five to ten minutes to complete for a
16 MHz 386, for example.  Rika's suggestion of running as many
iterations as needed until, say, 5 seconds elapses, would help but the
timing measurements would cause a lot of latency and will be imprecise
on very slow routines.  Still, let's see if 100,000 gives better results
for you.

Kit

On 10/10/2023 09:57, Tomas Hajny wrote:

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors. I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.

I don't know what's broken, but the results vary so much on a fast
machine that they are unusable for any measurement from my point of
view (standard 3.2.2 compiler, compiled with -O4 and running under MS
Windows this time). Sometimes the ADD version shows 0.0 ns/call,
sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call
(64-bits). See the attached results (the CPU is only displayed for the
64-bit compilation, but it's obviously the same CPU).

Tomas



On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:

Thank you very much!  That processor is built on the Excavator
architecture and lines up with the flag I put in the merge request
(i.e. it has the "fast LEA" hint).

I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the
CPU automatically.

On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
    Pascal control case: 5.1 ns/call
  Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel

Looking at the text log, the results are a bit strange and I can't 
easily explain it.  Normally a system interrupt would increase the time 
taken.


Let me know if increasing the iteration count fixes it or not.

Kit

On 10/10/2023 09:57, Tomas Hajny wrote:

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors. I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.


I don't know what's broken, but the results vary so much on a fast 
machine that they are unusable for any measurement from my point of 
view (standard 3.2.2 compiler, compiled with -O4 and running under MS 
Windows this time). Sometimes the ADD version shows 0.0 ns/call, 
sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call 
(64-bits). See the attached results (the CPU is only displayed for the 
64-bit compilation, but it's obviously the same CPU).


Tomas




On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the 
CPU automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel


Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I 
had reduced it to 10,000 because, coupled with the 1,000 iterations in 
the subroutines themselves, would have led to 1,000,000,000 passes and 
hence would take in the region of five to ten minutes to complete for a 
16 MHz 386, for example.  Rika's suggestion of running as many 
iterations as needed until, say, 5 seconds elapses, would help but the 
timing measurements would cause a lot of latency and will be imprecise 
on very slow routines.  Still, let's see if 100,000 gives better results 
for you.


Kit

On 10/10/2023 09:57, Tomas Hajny wrote:

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors. I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.


I don't know what's broken, but the results vary so much on a fast 
machine that they are unusable for any measurement from my point of 
view (standard 3.2.2 compiler, compiled with -O4 and running under MS 
Windows this time). Sometimes the ADD version shows 0.0 ns/call, 
sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call 
(64-bits). See the attached results (the CPU is only displayed for the 
64-bit compilation, but it's obviously the same CPU).


Tomas




On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the 
CPU automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef CPUX86_64}
function FillBrandName: Boolean; assembler; nostackframe;
asm
  PUSH RBX
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
  LEA  R8,  [RIP + CPUName]
  MOV  EAX, $8002
  CPUID
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
  MOV  EAX, $8003
  CPUID
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
  MOV  EAX, $8004
  CPUID
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
  POP  RBX
end;
{$else CPUX86_64}
function FillBrandName: Boolean; inline;
begin
  Result := False;
end;
{$endif CPUX86_64}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_re

Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread J. Gareth Moreton via fpc-devel

I updated the "blea" test in the merge request so it now displays the 
processor brand name on x86_64; however, it is not fetched under i386 
because CPUID was not introduced until later 486 processors.  I've 
attached it to this e-mail if anyone wants to take a look to ensure I 
haven't broken something.


Kit

On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the CPU 
automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef CPUX86_64}
function FillBrandName: Boolean; assembler; nostackframe;
asm
  PUSH RBX
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
  LEA  R8,  [RIP + CPUName]
  MOV  EAX, $8002
  CPUID
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
  MOV  EAX, $8003
  CPUID
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
  MOV  EAX, $8004
  CPUID
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
  POP  RBX
end;
{$else CPUX86_64}
function FillBrandName: Boolean; inline;
begin
  Result := False;
end;
{$endif CPUX86_64}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 1);
  time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
  if FillBrandName then
begin
  WriteLn('CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  WriteLn('--', CpuName);
end;
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.
___
fpc-devel maillist  -  fpc-devel@lists.freep

Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread J. Gareth Moreton via fpc-devel

Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request (i.e. 
it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the CPU 
automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread J. Gareth Moreton via fpc-devel


Thank you for the report.

According to Agner Fog's table, complex LEA instructions should have a 
3-cycle latency on that architecture (Haswell). Optimisations with this 
instruction are proving interesting because there's such a variety 
between processor architectures. There are some that are fine with 3 
components, but slows right down if a scale factor is used.


Kit

On 09/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:

Hi Gareth

model name : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

Regards

Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sun, Oct 8, 2023 at 6:40 PM J. Gareth Moreton via fpc-devel 
 wrote:


Hi Nataraj

Which processor is that run on? (although too close to call, it
implies LEA has a latency of 2 in that case)

Kit

On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:

Hi

[nataraj@dflyHP ~]$ fpc ttt.pas
Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: DragonFly for x86-64
Compiling ttt.pas
Linking ttt
/usr/local/bin/ld.bfd: warning:
/usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o:
missing .note.GNU-stack section implies executable stack
/usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and
will be removed in a future version of the linker
121 lines compiled, 14.9 sec
[nataraj@dflyHP ~]$ ./ttt
   Pascal control case: 6.7 ns/call
 Using LEA instruction: 4.2 ns/call
Using ADD instructions: 4.0 ns/call


Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel
 wrote:

That's interesting; I am interested to see the assembly
output for the
Pascal control cases.  As for the 64-bit version, that was my
fault
since the assembly language is for Microsoft's ABI rather
than the
System V ABI, so it was checking a register with an undefined
value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

    Pascal control case: 2.0 ns/call
  Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
> On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>
>
> Hi Kit,
>
>> Do you think this should suffice? Originally it ran for
1,000,000
>> repetitions but I fear that will take way too long on a
486, so I
>> reduced it to 10,000.
>
> OK, I tried it now. First of all, after turning on the old
machine, I
> realized that it wasn't Intel but AMD 486 DX4 - sorry for
my bad
> memory. :-( I compiled and ran the test under OS/2 there (I
was too
> lazy to boot it to DOS ;-) ), but I assume that it
shouldn't make any
> substantial difference. The ADD and LEA results were
basically the
> same there, both around 100 ns / call. The Pascal result
was around
> twice as long. Interestingly, the Pascal result for FPC
3.2.2 was
> around 10% longer than the same source compiled with FPC
2.0.3 (the
> assembler versions were obviously the same for both FPC
versions; I
> tried compiling it also with FPC 1.0.10 and the assembler
versions
> were more than three times slower due to missing support
for the
> nostackframe directive).
>
> I tested it under the AMD Athlon 1 GHz machine as well and
again, the
> results for LEA and ADD are basically equal (both 3.1
ns/call) and the
> result for Pascal slightly more than twice (7.3 ns/call).
However,
> rather surprisingly for me, the overall test run was _much_
longer
> there?! Finally, I tried compiling the test on a 64-bit
machine (AMD
> A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3
compiled from
> a fresh 3.2 branch). The Pascal version shows about 4
ns/call, but the
> assembler version runs forever - well, certainly much
longer than my
> patience lasts. I haven't tried to analyze the reasons, but
that's
> what I get.
>
> Tomas
>
>
>
    >>
>> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via
fpc-devel"

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel

Did some checking of the test I copied the code from, and I forgot that 
Rika's original code only exited once a certain time period had elapsed 
(e.g. 0.5 seconds).  I had changed it to a standard iteration count 
since I was concerned about fairness and accuracy, but I only changed 
the loop condition and nothing else.


Kit

On 08/10/2023 11:06, Marģers . via fpc-devel wrote:
1. why you leave "time:=..." in benchmark loop? It does add 50% of 
execution time per call.

2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, linux 
64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; 
assembler; nostackframe;

asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD 
instructions around LEA while results didn't chance, but only 2 around 
ADD.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel

In the meantime, here's the merge request for the feature based on user 
tests and studying of Agner Fog's instruction tables: 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel


Hi Nataraj

Which processor is that run on? (although too close to call, it implies 
LEA has a latency of 2 in that case)


Kit

On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:

Hi

[nataraj@dflyHP ~]$ fpc ttt.pas
Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: DragonFly for x86-64
Compiling ttt.pas
Linking ttt
/usr/local/bin/ld.bfd: warning: 
/usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing 
.note.GNU-stack section implies executable stack
/usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be 
removed in a future version of the linker

121 lines compiled, 14.9 sec
[nataraj@dflyHP ~]$ ./ttt
   Pascal control case: 6.7 ns/call
 Using LEA instruction: 4.2 ns/call
Using ADD instructions: 4.0 ns/call


Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel 
 wrote:


That's interesting; I am interested to see the assembly output for
the
Pascal control cases.  As for the 64-bit version, that was my fault
since the assembly language is for Microsoft's ABI rather than the
System V ABI, so it was checking a register with an undefined value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

    Pascal control case: 2.0 ns/call
  Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
> On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>
>
> Hi Kit,
>
>> Do you think this should suffice? Originally it ran for 1,000,000
>> repetitions but I fear that will take way too long on a 486, so I
>> reduced it to 10,000.
>
> OK, I tried it now. First of all, after turning on the old
machine, I
> realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad
> memory. :-( I compiled and ran the test under OS/2 there (I was too
> lazy to boot it to DOS ;-) ), but I assume that it shouldn't
make any
> substantial difference. The ADD and LEA results were basically the
> same there, both around 100 ns / call. The Pascal result was around
> twice as long. Interestingly, the Pascal result for FPC 3.2.2 was
> around 10% longer than the same source compiled with FPC 2.0.3 (the
> assembler versions were obviously the same for both FPC versions; I
> tried compiling it also with FPC 1.0.10 and the assembler versions
> were more than three times slower due to missing support for the
> nostackframe directive).
>
> I tested it under the AMD Athlon 1 GHz machine as well and
again, the
> results for LEA and ADD are basically equal (both 3.1 ns/call)
and the
> result for Pascal slightly more than twice (7.3 ns/call). However,
> rather surprisingly for me, the overall test run was _much_ longer
> there?! Finally, I tried compiling the test on a 64-bit machine
(AMD
> A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3
compiled from
> a fresh 3.2 branch). The Pascal version shows about 4 ns/call,
but the
> assembler version runs forever - well, certainly much longer
than my
> patience lasts. I haven't tried to analyze the reasons, but that's
> what I get.
>
> Tomas
>
>
>
>>
    >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via
fpc-devel"
>>>  wrote:
>>>
>>>
>>> Hii Kit,
>>>
>>>> This is mainly to Florian, but also to anyone else who can
answer
>>>> the question - at which point did a complex LEA instruction
(using
>>>> all three input operands and some other specific
circumstances) get
>>>> slow? Preliminary research suggests the 486 was when it gained
>>>> extra latency, and then Sandy Bridge when it got particularly
bad.
>>>> Icy Lake seems to be the architecture where faster LEA
instructions
>>>> are reintroduced, but I'm not sure about AMD processors.
>>> I cannot answer your question, but if you prepare a test
program, I
>>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz
machines
>>> if it helps you in any way (at least I hope the 486 DX2 machine
>>> should be still able to start ;-) ).
>>>
>>> Tomas
>>>
>>> ___
>>> fpc-devel maillist  - fpc-devel@lists.f

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel

Sorry, ignore last attachment - I forgot to change a line of assembly 
(it was correct for x86_64-win64!!). Here is the corrected version.


Kit

On 08/10/2023 12:38, J. Gareth Moreton via fpc-devel wrote:
Sorry, I got careless and was in a rush, as both the Pascal code is 
wrong and I didn't store the result of the benchmark test, hence the 
error check at the end returned a false negative.


The benchmark code was from Rika's SHA-1 test code, which I didn't 
properly check, although I assumed the logic was to avoid counting the 
time of the internal loop as much as possible.  I should have gone 
with my gut instinct and realised that wasn't the best method.


I've attached the updated test (now called "blea" as it's a benchmark 
test) with your suggestions implemented, and an improved benchmarking 
system.  I'm not used to specifying parameters in place of registers - 
I'm too used to needing total control!


Your results from experiments with adding additional ADD instructions 
is expected, as LEA uses an AGU for computation, leaving the ALUs free 
for other tasks (like ADD), so LEA is better even if speed is equal.


Kit

On 08/10/2023 11:06, Marģers . via fpc-devel wrote:
1. why you leave "time:=..." in benchmark loop? It does add 50% of 
execution time per call.

2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, 
linux 64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; 
assembler; nostackframe;

asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD 
instructions around LEA while results didn't chance, but only 2 
around ADD.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 1);
  time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel

Sorry, I got careless and was in a rush, as both the Pascal code is 
wrong and I didn't store the result of the benchmark test, hence the 
error check at the end returned a false negative.


The benchmark code was from Rika's SHA-1 test code, which I didn't 
properly check, although I assumed the logic was to avoid counting the 
time of the internal loop as much as possible.  I should have gone with 
my gut instinct and realised that wasn't the best method.


I've attached the updated test (now called "blea" as it's a benchmark 
test) with your suggestions implemented, and an improved benchmarking 
system.  I'm not used to specifying parameters in place of registers - 
I'm too used to needing total control!


Your results from experiments with adding additional ADD instructions is 
expected, as LEA uses an AGU for computation, leaving the ALUs free for 
other tasks (like ADD), so LEA is better even if speed is equal.


Kit

On 08/10/2023 11:06, Marģers . via fpc-devel wrote:
1. why you leave "time:=..." in benchmark loop? It does add 50% of 
execution time per call.

2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, linux 
64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; 
assembler; nostackframe;

asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD 
instructions around LEA while results didn't chance, but only 2 around 
ADD.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV EAX, ECX
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 1);
  time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-07 Thread J. Gareth Moreton via fpc-devel

I'm still slightly curious, but if full optimisations make better code, 
then indeed it's probably not worth the effort.


Your timings are incredibly helpful - thank you!  If I understand, AMD 
A9 is the Excavator architecture, which implies that AMD processors 
don't suffer from the same latency with complex LEA instructions as 
Intel processors do.


Looking at Agner Fog's tables, it looks like the slow LEA instructions 
only came about at Sandy Bridge, which for Free Pascal I think lines up 
with COREAVX.  Even the Pentium-era processors have a 1-cycle LEA, and 
your testing on an AMD 486 shows it is at least as fast as two ADDs in a 
dependency chain. That should be all the information I need - thanks again!


Kit

On 07/10/2023 19:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote:

That's interesting; I am interested to see the assembly output for the
Pascal control cases.  As for the 64-bit version, that was my fault
since the assembly language is for Microsoft's ABI rather than the
System V ABI, so it was checking a register with an undefined value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

   Pascal control case: 2.0 ns/call
 Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call


OK. My results for the AMD A9 CPU mentioned previously and 32-bit 
trunk compiler (Linux) are:


   Pascal control case: 2.3 ns/call
 Using LEA instruction: 1.2 ns/call
Using ADD instructions: 1.5 ns/call


The same machine, the same operating environment, but a 64-bit trunk 
compiler:


   Pascal control case: 3.6 ns/call
 Using LEA instruction: 0.9 ns/call
Using ADD instructions: 1.3 ns/call


I tried compiling and running the test with all of FPC 2.0.4, 2.2.4, 
2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized that 
all results (for both the assembler and Pascal versions) compiled with 
anything older than 3.2.2 are an order of magnitude faster than with 
3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8 
ns/call with Pascal / 4 ns/call with assembler versions). This means 
that the comparison is obviously spoiled with something unrelated. 
Moreover, I noticed that when compiling with the highest level of 
optimizations, the Pascal version compiled for i386 is as fast or even 
little bit faster than the assembler version. I didn't do that 
previously, thus the longer time for the older compiler version 
probably isn't relevant. From this point of view, it probably doesn't 
make sense to spend time on comparing the generated code.


Tomas




On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.


OK, I tried it now. First of all, after turning on the old machine, 
I realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
memory. :-( I compiled and ran the test under OS/2 there (I was too 
lazy to boot it to DOS ;-) ), but I assume that it shouldn't make 
any substantial difference. The ADD and LEA results were basically 
the same there, both around 100 ns / call. The Pascal result was 
around twice as long. Interestingly, the Pascal result for FPC 3.2.2 
was around 10% longer than the same source compiled with FPC 2.0.3 
(the assembler versions were obviously the same for both FPC 
versions; I tried compiling it also with FPC 1.0.10 and the 
assembler versions were more than three times slower due to missing 
support for the nostackframe directive).


I tested it under the AMD Athlon 1 GHz machine as well and again, 
the results for LEA and ADD are basically equal (both 3.1 ns/call) 
and the result for Pascal slightly more than twice (7.3 ns/call). 
However, rather surprisingly for me, the overall test run was _much_ 
longer there?! Finally, I tried compiling the test on a 64-bit 
machine (AMD A9-9425) with Linux (compiled for 64-bits with FPC 
3.2.3 compiled from a fresh 3.2 branch). The Pascal version shows 
about 4 ns/call, but the assembler version runs forever - well, 
certainly much longer than my patience lasts. I haven't tried to 
analyze the reasons, but that's what I get.


Tomas





On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via 
fpc-devel"  wrote:



Hii Kit,

This is mainly to Florian, but also to anyone else who can answer 
the question - at which point did a complex LEA instruction 
(using all three input operands and some other specific 
circumstances) get slow? Preliminary research suggests the 486 
was when it gained extra latency, and then Sandy Bridge when it 
got particularly bad.  Icy Lake seems to be the architecture 
where faster LEA instructions are reintroduced, but I'm not sure 
about AMD processors.
I cannot answer your question, but if you prep

Re: [fpc-devel] LEA instruction speed

2023-10-07 Thread J. Gareth Moreton via fpc-devel

That's interesting; I am interested to see the assembly output for the 
Pascal control cases.  As for the 64-bit version, that was my fault 
since the assembly language is for Microsoft's ABI rather than the 
System V ABI, so it was checking a register with an undefined value.  
Find attached the fixed test.


Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

   Pascal control case: 2.0 ns/call
 Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.


OK, I tried it now. First of all, after turning on the old machine, I 
realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
memory. :-( I compiled and ran the test under OS/2 there (I was too 
lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any 
substantial difference. The ADD and LEA results were basically the 
same there, both around 100 ns / call. The Pascal result was around 
twice as long. Interestingly, the Pascal result for FPC 3.2.2 was 
around 10% longer than the same source compiled with FPC 2.0.3 (the 
assembler versions were obviously the same for both FPC versions; I 
tried compiling it also with FPC 1.0.10 and the assembler versions 
were more than three times slower due to missing support for the 
nostackframe directive).


I tested it under the AMD Athlon 1 GHz machine as well and again, the 
results for LEA and ADD are basically equal (both 3.1 ns/call) and the 
result for Pascal slightly more than twice (7.3 ns/call). However, 
rather surprisingly for me, the overall test run was _much_ longer 
there?! Finally, I tried compiling the test on a 64-bit machine (AMD 
A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from 
a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the 
assembler version runs forever - well, certainly much longer than my 
patience lasts. I haven't tried to analyze the reasons, but that's 
what I get.


Tomas





On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:



Hii Kit,

This is mainly to Florian, but also to anyone else who can answer 
the question - at which point did a complex LEA instruction (using 
all three input operands and some other specific circumstances) get 
slow? Preliminary research suggests the 486 was when it gained 
extra latency, and then Sandy Bridge when it got particularly bad.  
Icy Lake seems to be the architecture where faster LEA instructions 
are reintroduced, but I'm not sure about AMD processors.
I cannot answer your question, but if you prepare a test program, I 
can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines 
if it helps you in any way (at least I hope the 486 DX2 machine 
should be still able to start ;-) ).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
program leatest;
{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := X + Counter + $87654321;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
{$ifdef CPUX86_64}
  {$ifdef MSWINDOWS}
  ADD ECX, $87654321
  ADD ECX, EDX
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop1
  MOV EAX, ECX
  {$else MSWINDOWS}
  ADD EDI, $87654321
  ADD EDI, ESI
  XOR EDI, EDX
  DEC EDX
  JNZ @Loop1
  MOV EAX, EDI
  {$endif MSWINDOWS}
{$else CPUX86_64}
  ADD EAX, $87654321
  ADD EAX, EDX
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop1
{$endif CPUX86_64}
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
{$ifdef CPUX86_64}
  {$ifdef MSWINDOWS}
  LEA ECX, [ECX + EDX + $87654321]
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop2
  MOV EAX, ECX
  {$else MSWINDOWS}
  LEA EDI, [EDI + ESI + $87654321]
  XOR EDI, EDX
  DEC EDX
  JNZ @Loop2
  MOV EAX, EDI
  {$endif MSWINDOWS}
{$else CPUX86_64}
  LEA EAX, [EAX + EDX + $87654321]
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop2
{$endif CPUX86_64}
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const

Re: [fpc-devel] LEA instruction speed

2023-10-06 Thread J. Gareth Moreton via fpc-devel


Hi Tomas,

Do you think this should suffice? Originally it ran for 1,000,000 
repetitions but I fear that will take way too long on a 486, so I 
reduced it to 10,000.


Kit

On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:

On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:


Hii Kit,


This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?  Preliminary research suggests 
the 486 was when it gained extra latency, and then Sandy Bridge when it got 
particularly bad.  Icy Lake seems to be the architecture where faster LEA 
instructions are reintroduced, but I'm not sure about AMD processors.

I cannot answer your question, but if you prepare a test program, I can run it 
on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in 
any way (at least I hope the 486 DX2 machine should be still able to start ;-) 
).

Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
program leatest;
{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := X + Counter + $87654321;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
{$ifdef CPUX86_64}
  ADD ECX, $87654321
  ADD ECX, EDX
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop1
  MOV EAX, ECX
{$else CPUX86_64}
  ADD EAX, $87654321
  ADD EAX, EDX
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop1
{$endif CPUX86_64}
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
{$ifdef CPUX86_64}
  LEA ECX, [ECX + EDX + $87654321]
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop2
  MOV EAX, ECX
{$else CPUX86_64}
  LEA EAX, [EAX + EDX + $87654321]
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop2
{$endif CPUX86_64}
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
proc(Result, X, internal_reps);
time := (Now - start) * SecsPerDay;
  until (reps >= 1);
  time := time / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread J. Gareth Moreton via fpc-devel

What should I call a new sub-CPU option?  Should it be "ICELAKE" or is 
there a better name like "CORE10" or "COREX" (X being the Roman numeral 
for 10, standing in for the 10th generation of Intel Core)?


Kit

On 03/10/2023 08:02, Florian Klämpfl via fpc-devel wrote:



Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel 
:

Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?

Maybe check Agner’s list?


Preliminary research suggests the 486 was when it gained extra latency, and 
then Sandy Bridge when it got particularly bad.  Icy Lake seems to be the 
architecture where faster LEA instructions are reintroduced, but I'm not sure 
about AMD processors.

Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better 
name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)?

If it doesn’t fit in the existing ones, you can always add new ones


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread J. Gareth Moreton via fpc-devel

I don't think any of them currently fit, although Zen 3 is later than 
Ice Lake, but I'm not sure if it has a faster LEA or not. I'll do some 
investigation.  I'll take up Tomas' offer on the 486 test though.  
Personally I think the best test might actually be one of the 
recently-optimised cryptographic functions.


Kit

On 03/10/2023 08:02, Florian Klämpfl via fpc-devel wrote:



Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel 
:

Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?

Maybe check Agner’s list?


Preliminary research suggests the 486 was when it gained extra latency, and 
then Sandy Bridge when it got particularly bad.  Icy Lake seems to be the 
architecture where faster LEA instructions are reintroduced, but I'm not sure 
about AMD processors.

Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better 
name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)?

If it doesn’t fit in the existing ones, you can always add new ones


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread J. Gareth Moreton via fpc-devel


Hmmm, could be fun to attempt to test - I'll see what I can set up.

Kit

On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:

On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:


Hii Kit,


This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?  Preliminary research suggests 
the 486 was when it gained extra latency, and then Sandy Bridge when it got 
particularly bad.  Icy Lake seems to be the architecture where faster LEA 
instructions are reintroduced, but I'm not sure about AMD processors.

I cannot answer your question, but if you prepare a test program, I can run it 
on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in 
any way (at least I hope the 486 DX2 machine should be still able to start ;-) 
).

Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

2023-10-02 Thread J. Gareth Moreton via fpc-devel


(And I meant "Ice Lake", not "Icy Lake")

On 03/10/2023 02:32, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the 
question - at which point did a complex LEA instruction (using all 
three input operands and some other specific circumstances) get slow?  
Preliminary research suggests the 486 was when it gained extra 
latency, and then Sandy Bridge when it got particularly bad.  Icy Lake 
seems to be the architecture where faster LEA instructions are 
reintroduced, but I'm not sure about AMD processors.


Should I introduce a new x86 subprocessor named "ICYLAKE" or is there 
a better name or does it fall under one of our categories already 
(CORE_AVX2 or ZEN3)?


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] LEA instruction speed

2023-10-02 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the 
question - at which point did a complex LEA instruction (using all three 
input operands and some other specific circumstances) get slow?  
Preliminary research suggests the 486 was when it gained extra latency, 
and then Sandy Bridge when it got particularly bad.  Icy Lake seems to 
be the architecture where faster LEA instructions are reintroduced, but 
I'm not sure about AMD processors.


Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a 
better name or does it fall under one of our categories already 
(CORE_AVX2 or ZEN3)?


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] A call to help test pure functions

2023-10-02 Thread J. Gareth Moreton via fpc-devel

As an additional note - apologies to those who responded to me directly, 
but for some reason, GMail doesn't like e-mails coming from my domain 
name, so I have to use my own GMail account, watercran...@gmail.com to 
respond.


Kit

On 02/10/2023 18:21, J. Gareth Moreton via fpc-devel wrote:
No mode switch - just apply the changes from my branch over at 
https://gitlab.com/CuriousKit/optimisations/-/tree/pure?ref_type=heads. 
To mark a function as pure, append the directive "pure" after the 
function definition, like you would with "virtual" or "inline", say.


Given it's a Free Pascal construct, it probably should be disabled in 
Delphi mode etc, but currently it isn't.


Kit

On 02/10/2023 12:43, Mattias Gaertner via fpc-devel wrote:



On 29.09.23 21:28, J. Gareth Moreton via fpc-devel wrote:

[...]  As the examples imply, to mark as a function as pure, simply
use the new "pure" directive.


When is it available? Is there a modeswitch?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] A call to help test pure functions

2023-10-02 Thread J. Gareth Moreton via fpc-devel

No mode switch - just apply the changes from my branch over at 
https://gitlab.com/CuriousKit/optimisations/-/tree/pure?ref_type=heads. 
To mark a function as pure, append the directive "pure" after the 
function definition, like you would with "virtual" or "inline", say.


Given it's a Free Pascal construct, it probably should be disabled in 
Delphi mode etc, but currently it isn't.


Kit

On 02/10/2023 12:43, Mattias Gaertner via fpc-devel wrote:



On 29.09.23 21:28, J. Gareth Moreton via fpc-devel wrote:

[...]  As the examples imply, to mark as a function as pure, simply
use the new "pure" directive.


When is it available? Is there a modeswitch?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] A call to help test pure functions

2023-09-29 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

This has been something that's been in development for some time now, 
and I invite other Free Pascal users and developers to test the new 
feature... pure functions.  Its closest equivalent in C++ would be 
"constexpr".


A pure function has no side-effects - it doesn't affect the machine 
state outside of its scope.  More broadly, it is analogous to a 
mathematical function, so for a given input(s), will always return the 
same output.  The idea is that if a function is marked as pure, and 
confirmed to be by the compiler, then in certain situations, its result 
can be computed at compile time and the call replaced with the 
calculated result as a literal.  I've attached two examples showing pure 
functions in action, one that unrolls a for-loop, and one that handles 
recursion.  As the examples imply, to mark as a function as pure, simply 
use the new "pure" directive.  By default, subroutines are not 
considered pure because, while the compiler would easily be able to 
defermine if a function is actually pure or not, this would cause 
significant slowdown to the compilation process.


You can access my repository here: 
https://gitlab.com/CuriousKit/optimisations/-/tree/pure?ref_type=heads


Note that pure functions are subtly different to inline functions, and 
while there is a lot of overlap, they have different use cases.  Some 
pure functions can theoretically be extremely complex and would not be 
something you'd want to inline, but is guaranteed to produce a 
deterministic result for a given input.  A theoretical example would be 
a hash function (although my pure functions do not support pointers 
because the data they point to is not deterministic until you 
dereference it). Similarly, a small inline function that accesses a 
global variable (e.g. to read or modify a reference count) cannot be a 
pure function because the compiler has no way of knowing what that 
variable will be set to at compile time.


Currently there's no support to assign the result of a pure function to 
a constant, although I intend to find a means to support it one day.  
Additionally, procedures with "out" parameters unfortunately tend to not 
be evaluated correctly and will error out (the compiler might consider 
them impure even though they actually are.  If the "out" parameter is 
passed into another function through a call, even if it's to another 
pure function or itself (recursion), the parameter is considered 
"escaped" and so can't be optimised.  So far I haven't worked out how to 
get around this because the load nodes are marked as "address taken" as 
well as the definition of the formal parameter itself.  Temporarily 
modifying these definitions during pure function analysis is somewhat 
dangerous and error-prone.  I'll solve this eventually though.


One of my current goals is to make "IntToStr", which calls "Str" 
internally, a pure function.  One might question the point of this, 
since who in their right mind would write something like "Output := 
IntToStr(5);"  The reasul is that another pure function (one that may 
generate a string for a notification message, for example) may call 
IntToStr with an actual parameter that's a variable, but said variable 
is deterministic within the function and so would be replaced with an 
integer literal by the time "IntToStr" is evaluated.  If "IntToStr" can 
be successfully made a pure function, then that can be considered a 
milestone and other similar functions can follow.


Other possibilities:

for N := 1 to 10 do
  Y := N * SomePureFunc(X);

Here, X is an unknown local variable.  However, if it is not modified 
within the for-loop, SomePureFunc(X) will always return the same value 
(if it is actually a pure function), so the compiler could theoretically 
optimise it to the equivalent code:


Z := SomePureFunc(X);
for N := 1 to 10 do
  Y := N * Z;

This could be achieved through the tempcreate / tempref / tempdelete 
nodes that are often used when expanding inline functions.


Any suggestions, requests, bug reports or, heck, even a merge request, 
please send them my way!


Yours faithfully,

J. Gareth "Kit" Moreton
program pure1a;
{$MODE OBJFPC}
{$COPERATORS ON}

function Factorial(N: Cardinal): Cardinal; pure;
  var
X: Integer;
  begin
Result := 1;
for X := N downto 2 do
  Result *= X;
  end;

begin
  WriteLn(Factorial(5));
end.program pure1b;
{$MODE OBJFPC}

function Factorial(N: Cardinal): Cardinal; pure;
  begin
if N < 2 then
  Result := 1
else
  Result := N * Factorial(N - 1);
  end;
  
begin
  WriteLn(Factorial(5));
end.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Request to add SHA-2 and Keccak (SHA-3) to hash package

2023-09-28 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

Given the current work on optimising and fixing the hash package, I 
would like to propose adding two additional families of hash functions 
to the package:


 * SHA-2: namely SHA-224 and SHA-256 (they essentially share the
   algorithm with some changes to the initial constants and final
   output), and SHA-384 and SHA-512 (also share an algorithm). Possibly
   also SHA-512/224 and SHA-512/256.
 * SHA-3, aka. Keccak.  Initially just SHA3-224, SHA3-256, SHA3-384 and
   SHA3-512.  There's quite a lot available here, like the SHAKE
   algorithms, and will probably take more time to implement.

Would this be an acceptable addition?  If someone wants to take on the 
challenge, go for it.  I have a new work contract starting very soon so 
if I were to do it, it would take some time... that and I've got a 
backlog of merge requests for Free Pascal that I'm still waiting for 
approval or rejection on.


Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Some handy information regarding LEA instructions

2023-09-22 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I just discovered this while trying to optimise some of the hash 
functions.  This might already be known, but in case it isn't, here's 
something useful to know.


The LEA instruction is useful because you can essentially perform "x := 
y + z + const" with one instruction, or just "x := y + z" or "x := y + 
const" if none of the source and destination registers match.  However, 
on Sandy Bridge and later (not sure about AMD processors) the 3-operand 
version has a 3-cycle latency and only one execution port (reduces 
concurrency if there are nearby instructions that fetch addresses), but 
the 2-operand version (whether reg/reg or reg/const) has only a single 
cycle latency and can be dispatched to at least two different ports.


Long story short, if you have something like:

LEA ECX, [ECX + EAX + $f57c0faf]
ROL ECX, 7

There is a 2-cycle delay before the ROL instruction can be executed.  
However, if you expand LEA into two ADD instructions:


ADD ECX, EAX
ADD ECX, $f57c0faf
ROL ECX, 7

Though slightly larger, this triplet executes one cycle faster overall 
because there's no additional latency between the instructions.


The 3-operand LEA instruction is still useful in a few cases though:

    - If all the registers are different though, since to expand it 
into arithmetic/logical instructions, it would require an additional MOV 
instruction, which doesn't offer any speed bonuses and just increases 
code size.


    - In cases where the destination is the same as one of the source 
registers, as long as the destination isn't used for at least 3 cycles, 
then it is a saving (minimising concurrent uses of the AGU execution 
ports also helps).


    - And of course, if one of the registers has a scalar muliplier, 
then this is also faster than equivalent arithmetic/logical instructions.


With all this in mind I'll have a ponder about introducing a new 
peephole optimisation that expands potentially slow LEA instructions.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] x86_64 SHA1 implementation (J. Gareth Moreton)

2023-09-17 Thread J. Gareth Moreton via fpc-devel

I will admit that part of me likes to program my own implementations in 
assembly language if just for the practice, especially learning where 
latency and stalls happen.  The problem with most of the examples given 
in this chain is that they use a 'high-level' assembly language with 
macros and variables among other things, or use the SHA-specific 
opcodes.  The mORMot 2 example looks good for the SHA-specific opcodes 
though and I'll give that one a try.  It's under a MPL/GPL/LGPL licence 
so I think that's compatible for us.


Granted, at the very least, the "basic" x86_64 port of the i386 code 
works fine, so we always have this one to fall back on.


Kit

On 17/09/2023 08:37, Florian Klämpfl via fpc-devel wrote:



Am 17.09.2023 um 08:45 schrieb Arnaud Bouchez via fpc-devel 
:


There is a working SHA-1 and SHA-256 implementation using x86_64 asm and also 
SHA-NI in mORMot 2.

Numbers are very high, e.g. 2GB/s on my Core i5 13500.

Since there is no SHA opcode in FPC asm yet (neither in Delphi), I am using hardcoded 
"db" arrays for SHA-NI instructions.

Trunk should have them.


See 
https://github.com/synopse/mORMot2/blob/4c59b3c212c5efd2f440c1d7f61504ca832a5931/src/crypt/mormot.crypt.core.asmx64.inc#L1154
 for the SHA-NI asm.

Arnaud

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] x86_64 SHA1 implementation

2023-09-16 Thread J. Gareth Moreton via fpc-devel

True, that's fair.  Just having this feature for the sake of compiling 
the "hash" package more efficiently is... well... favouritism?!  Or just 
a feature that has a single purpose.


Kit

On 16/09/2023 17:45, Florian Klämpfl via fpc-devel wrote:



Am 16.09.2023 um 17:45 schrieb J. Gareth Moreton via fpc-devel 
:

I missed this post - thanks Florian!

Indeed, SHA-1 is deprecated at least as far as being a cryptographic algorithm 
is concerned, but it still has some uses in data verification in a similar vein 
to MD5.  I know git uses it internally so server branches can't be corrupted.

I have probably spent too much time on SHA-1 already - its awkward size of 160 
bits has always irked me... not a clean power of two!

Speaking of the Intel SHA instructions, can I introduce a merge request that adds 
"CPUX86_HAS_SHA" as a feature flag?

You can, but in this case it is imo more useful to check at run time by using 
the cpu unit as the instructions are part of a procedure not generated by the 
compiler.


I know to add it for "cpu_zen" and later, but I'm not sure what the equivalent Intel 
processor is... is "cpu_core_avx2" okay or does there need to be a new one?

Kit

On 15/09/2023 22:48, Florian Klämpfl via fpc-devel wrote:

Am 16.09.23 um 15:13 schrieb J. Gareth Moreton via fpc-devel:

Hi everyone,

So this past week I've been building on Rika's work by adding an assembly 
version of SHA-1 for x86_64 to complement Rika's i386 version. So far I've 
successfully made a version that runs twice as fast as the Pascal code.  I 
hoped to go even faster by making use of the SSE2 instruction set, but 
currently the end result is slower even though computing the common parts of 4 
rounds simultaneously should be much faster.  This occurs even when I forgo 
writing to the stack and keep pretty much all of the state within registers.  
Preliminary investigation suggests that the slowdown comes from using MOVD/Q to 
transfer data between the XMM registers and general-purpose registers, since 
they are different parts of the CPU.  I'm still amazed it causes this much 
latency though.

I'll keep investigating and seeing if I can squeeze out more performance, but 
otherwise I may just have to fall back on a non-SIMD-optimised implementation.

As SHA-1 is basically deprecated and not recommended to be used anymore, I 
wouldn't spend too much into this. Besides this, for SHA-1 and SHA-256, it 
might be even more useful to use the SHA CPU extensions if available. While 
they are only introduced in Ice Lake and Zen, they will get more and more 
available in the future.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] x86_64 SHA1 implementation

2023-09-16 Thread J. Gareth Moreton via fpc-devel


I missed this post - thanks Florian!

Indeed, SHA-1 is deprecated at least as far as being a cryptographic 
algorithm is concerned, but it still has some uses in data verification 
in a similar vein to MD5.  I know git uses it internally so server 
branches can't be corrupted.


I have probably spent too much time on SHA-1 already - its awkward size 
of 160 bits has always irked me... not a clean power of two!


Speaking of the Intel SHA instructions, can I introduce a merge request 
that adds "CPUX86_HAS_SHA" as a feature flag?  I know to add it for 
"cpu_zen" and later, but I'm not sure what the equivalent Intel 
processor is... is "cpu_core_avx2" okay or does there need to be a new one?


Kit

On 15/09/2023 22:48, Florian Klämpfl via fpc-devel wrote:

Am 16.09.23 um 15:13 schrieb J. Gareth Moreton via fpc-devel:

Hi everyone,

So this past week I've been building on Rika's work by adding an 
assembly version of SHA-1 for x86_64 to complement Rika's i386 
version. So far I've successfully made a version that runs twice as 
fast as the Pascal code.  I hoped to go even faster by making use of 
the SSE2 instruction set, but currently the end result is slower even 
though computing the common parts of 4 rounds simultaneously should 
be much faster.  This occurs even when I forgo writing to the stack 
and keep pretty much all of the state within registers.  Preliminary 
investigation suggests that the slowdown comes from using MOVD/Q to 
transfer data between the XMM registers and general-purpose 
registers, since they are different parts of the CPU.  I'm still 
amazed it causes this much latency though.


I'll keep investigating and seeing if I can squeeze out more 
performance, but otherwise I may just have to fall back on a 
non-SIMD-optimised implementation.


As SHA-1 is basically deprecated and not recommended to be used 
anymore, I wouldn't spend too much into this. Besides this, for SHA-1 
and SHA-256, it might be even more useful to use the SHA CPU 
extensions if available. While they are only introduced in Ice Lake 
and Zen, they will get more and more available in the future.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] x86_64 SHA1 implementation

2023-09-16 Thread J. Gareth Moreton via fpc-devel

Thanks for the resources - these will prove very useful!  Intel and AMD 
processors also have specialised SHA instructions later on.  I know the 
AMD Zen supports them - not sure the earliest Intel models though.


Currently I'm sticking with pure SSE2 since this is the latest 
instruction set that is guaranteed to be available on all x86_64 
processors.  I can write versions for SSSE3 and AVX later, but currently 
I'm trying to identify the mysterious performance drops.


Kit

On 16/09/2023 16:18, Wayne Sherman wrote:

J. Gareth Moreton via fpc-devel  wrote:

So this past week I've been building on Rika's work by adding an
assembly version of SHA-1 for x86_64 to complement Rika's i386 version.
So far I've successfully made a version that runs twice as fast as the
Pascal code.  I hoped to go even faster by making use of the SSE2
instruction set...

In 2010 Intel published SSE3 code to improve SHA1 performance.  Later
that year it was incorporated into OpenSSL ASM code.  The OpenSSL code
also includes AVX and SHA acceleration extensions.

Intel Article:
https://www.intel.com/content/www/us/en/developer/articles/technical/improving-the-performance-of-the-secure-hash-algorithm-1.html

Brief on Intel SHA extensions (also works for AMD Zen and later CPUs)
https://en.wikipedia.org/wiki/Intel_SHA_extensions

OpenSSL x86 64-bit assembly code and performance chart
https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha1-x86_64.pl

##
# Current performance is summarized in following table. Numbers are
# CPU clock cycles spent to process single byte (less is better).
#
#   x86_64 SSSE3AVX[2]
# P49.05   -
# Opteron   6.26   -
# Core2 6.55   6.05/+8% -
# Westmere  6.73   5.30/+27%-
# Sandy Bridge  7.70   6.10/+26%4.99/+54%
# Ivy Bridge6.06   4.67/+30%4.60/+32%
# Haswell   5.45   4.15/+31%3.57/+53%
# Skylake   5.18   4.06/+28%3.54/+46%
# Bulldozer 9.11   5.95/+53%
# Ryzen 4.75   3.80/+24%1.93/+150%(**)
# VIA Nano  9.32   7.15/+30%
# Atom  10.3   9.17/+12%
# Silvermont13.1(*)9.37/+40%
# Knights L 13.2(*)9.68/+36%8.30/+59%
# Goldmont  8.13   6.42/+27%1.70/+380%(**)
#
# (*) obviously suboptimal result, nothing was done about it,
# because SSSE3 code is compiled unconditionally;
# (**) SHAEXT result


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] x86_64 SHA1 implementation

2023-09-16 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

So this past week I've been building on Rika's work by adding an 
assembly version of SHA-1 for x86_64 to complement Rika's i386 version.  
So far I've successfully made a version that runs twice as fast as the 
Pascal code.  I hoped to go even faster by making use of the SSE2 
instruction set, but currently the end result is slower even though 
computing the common parts of 4 rounds simultaneously should be much 
faster.  This occurs even when I forgo writing to the stack and keep 
pretty much all of the state within registers.  Preliminary 
investigation suggests that the slowdown comes from using MOVD/Q to 
transfer data between the XMM registers and general-purpose registers, 
since they are different parts of the CPU.  I'm still amazed it causes 
this much latency though.


I'll keep investigating and seeing if I can squeeze out more 
performance, but otherwise I may just have to fall back on a 
non-SIMD-optimised implementation.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Progress on pure functions

2023-08-17 Thread J. Gareth Moreton via fpc-devel

Getting there!  I'm having one sticking point, and that's with pure 
functions of the following design:


procedure Factorial(N: Cardinal; out R: Cardinal); pure;
  begin
    if N < 2 then
      R := 1
    else
      begin
    Factorial(N - 1, R);
    R *= N;
      end;
  end;

Because of the recursive nature with the "out" parameter, it gets 
flagged as "address taken" and, as a result, constant propagation will 
not function.  The deepest level of recursion works because it 
essentially collapses into just "R := 1" and this can be inserted in 
place of "Factorial(N - 1, R);" when N = 2, but constant propagation 
refuses to go any further.


I can't modify the symtable because pure functions may have to be called 
as regular functions (in the above example, if Factorial is called where 
the actual parameter for N is a variable, it won't be analysed) so I'm 
trying to work out if I can create a temporary procdef and symtable.  So 
far I haven't had much luck, but I'll keep up the work.


Kit

On 16/08/2023 05:05, J. Gareth Moreton via fpc-devel wrote:
Fixed my problem with the recursive function (enabling range check and 
overflow errors blocked dead-store elimination, so I worked around 
that) and the warning no longer cascades.  Progress is being made!


Kit

On 16/08/2023 04:02, J. Gareth Moreton via fpc-devel wrote:
So managed to stop the cascade in a fairly clean way (it detects the 
difference in ErrorCount, marks the call node as erroneous and flags 
"codegenerror"), and it seems to work.


pure1a.pp(15,24) Error: Range check error while evaluating constants 
(6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Error: At least 1 error(s) occurred during the 
purity analysis of subroutine "Factorial".

pure1a.pp(17) Fatal: There were 2 errors compiling module, stopping
Fatal: Compilation aborted

The message says "at least X error(s)" because other errors may get 
masked.


My recursive version of the factorial function doesn't quite simplify 
properly yet (it compiles successfully, but the call is a regular 
call and a warning is generated):


pure1b.pp(15,23) Warning: Subroutine "Factorial" is not eligible to 
be a pure function due to the following reason: analysis did not 
produce simple assignment.


Being a warning, this cascades and is generated multiple times (the 
number of times depends on how deep the recursion goes). I'll work on 
suppressing the cascade too since when this warning is triggered, the 
"pure" flag is removed from the subroutine.


Currently the "analysis did not produce simple assignment" part is a 
hard-coded string and not a part of errore.msg, for exmaple, so there 
may need to be a way to adapt this to be multi-lingual.


Kit

On 12/08/2023 18:14, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

So I'm still working on pure functions and have pushed some merge 
requests that are indirectly related to it, mostly simplifying the 
node tree so it can more easily be collapsed into simple assignments 
(what pure functions should simplify to).  Negative testing is still 
limited, but I have stumbled across one potential problem.  Using 
the following code example:


program pure1a;
{$MODE OBJFPC}
{$COPERATORS ON}

function Factorial(N: Cardinal): Cardinal; pure;
  var
    X: Integer;
  begin
    Result := 1;
    for X := N downto 2 do
  Result *= X;
  end;

begin
  WriteLn(Factorial(32));
end.

For those not familiar with the factorial function, the result 
increases in value very quickly to the point that 13! > 2^32 
(LongWord), 21! > 2^64 (QWord) and 70! > 10^100 (a googol), so 32! 
is guaranteed to overflow.


Having not really handled this eventuality just yet, I decided to 
see what happens when the compiler runs it as is:


pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (27048749056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (19184179200 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (32068960256 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (34071216128 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-288522240 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (4006445056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-5193400320 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-898433024 must be be

Re: [fpc-devel] Progress on pure functions

2023-08-15 Thread J. Gareth Moreton via fpc-devel

Fixed my problem with the recursive function (enabling range check and 
overflow errors blocked dead-store elimination, so I worked around that) 
and the warning no longer cascades.  Progress is being made!


Kit

On 16/08/2023 04:02, J. Gareth Moreton via fpc-devel wrote:
So managed to stop the cascade in a fairly clean way (it detects the 
difference in ErrorCount, marks the call node as erroneous and flags 
"codegenerror"), and it seems to work.


pure1a.pp(15,24) Error: Range check error while evaluating constants 
(6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Error: At least 1 error(s) occurred during the purity 
analysis of subroutine "Factorial".

pure1a.pp(17) Fatal: There were 2 errors compiling module, stopping
Fatal: Compilation aborted

The message says "at least X error(s)" because other errors may get 
masked.


My recursive version of the factorial function doesn't quite simplify 
properly yet (it compiles successfully, but the call is a regular call 
and a warning is generated):


pure1b.pp(15,23) Warning: Subroutine "Factorial" is not eligible to be 
a pure function due to the following reason: analysis did not produce 
simple assignment.


Being a warning, this cascades and is generated multiple times (the 
number of times depends on how deep the recursion goes). I'll work on 
suppressing the cascade too since when this warning is triggered, the 
"pure" flag is removed from the subroutine.


Currently the "analysis did not produce simple assignment" part is a 
hard-coded string and not a part of errore.msg, for exmaple, so there 
may need to be a way to adapt this to be multi-lingual.


Kit

On 12/08/2023 18:14, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

So I'm still working on pure functions and have pushed some merge 
requests that are indirectly related to it, mostly simplifying the 
node tree so it can more easily be collapsed into simple assignments 
(what pure functions should simplify to).  Negative testing is still 
limited, but I have stumbled across one potential problem.  Using the 
following code example:


program pure1a;
{$MODE OBJFPC}
{$COPERATORS ON}

function Factorial(N: Cardinal): Cardinal; pure;
  var
    X: Integer;
  begin
    Result := 1;
    for X := N downto 2 do
  Result *= X;
  end;

begin
  WriteLn(Factorial(32));
end.

For those not familiar with the factorial function, the result 
increases in value very quickly to the point that 13! > 2^32 
(LongWord), 21! > 2^64 (QWord) and 70! > 10^100 (a googol), so 32! is 
guaranteed to overflow.


Having not really handled this eventuality just yet, I decided to see 
what happens when the compiler runs it as is:


pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (27048749056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (19184179200 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (32068960256 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (34071216128 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-288522240 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (4006445056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-5193400320 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-898433024 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (3396534272 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-17070227456 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-2102132736 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (2192834560 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-44144787456 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-1195114496 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (3099852800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-26292518912 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating 
constants (-522715136 mu

Re: [fpc-devel] Progress on pure functions

2023-08-15 Thread J. Gareth Moreton via fpc-devel

So managed to stop the cascade in a fairly clean way (it detects the 
difference in ErrorCount, marks the call node as erroneous and flags 
"codegenerror"), and it seems to work.


pure1a.pp(15,24) Error: Range check error while evaluating constants 
(6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Error: At least 1 error(s) occurred during the purity 
analysis of subroutine "Factorial".

pure1a.pp(17) Fatal: There were 2 errors compiling module, stopping
Fatal: Compilation aborted

The message says "at least X error(s)" because other errors may get masked.

My recursive version of the factorial function doesn't quite simplify 
properly yet (it compiles successfully, but the call is a regular call 
and a warning is generated):


pure1b.pp(15,23) Warning: Subroutine "Factorial" is not eligible to be a 
pure function due to the following reason: analysis did not produce 
simple assignment.


Being a warning, this cascades and is generated multiple times (the 
number of times depends on how deep the recursion goes). I'll work on 
suppressing the cascade too since when this warning is triggered, the 
"pure" flag is removed from the subroutine.


Currently the "analysis did not produce simple assignment" part is a 
hard-coded string and not a part of errore.msg, for exmaple, so there 
may need to be a way to adapt this to be multi-lingual.


Kit

On 12/08/2023 18:14, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

So I'm still working on pure functions and have pushed some merge 
requests that are indirectly related to it, mostly simplifying the 
node tree so it can more easily be collapsed into simple assignments 
(what pure functions should simplify to).  Negative testing is still 
limited, but I have stumbled across one potential problem.  Using the 
following code example:


program pure1a;
{$MODE OBJFPC}
{$COPERATORS ON}

function Factorial(N: Cardinal): Cardinal; pure;
  var
    X: Integer;
  begin
    Result := 1;
    for X := N downto 2 do
  Result *= X;
  end;

begin
  WriteLn(Factorial(32));
end.

For those not familiar with the factorial function, the result 
increases in value very quickly to the point that 13! > 2^32 
(LongWord), 21! > 2^64 (QWord) and 70! > 10^100 (a googol), so 32! is 
guaranteed to overflow.


Having not really handled this eventuality just yet, I decided to see 
what happens when the compiler runs it as is:


pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(27048749056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(19184179200 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(32068960256 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(34071216128 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-288522240 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(4006445056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-5193400320 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-898433024 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3396534272 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-17070227456 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-2102132736 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(2192834560 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-44144787456 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-1195114496 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3099852800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-26292518912 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-522715136 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3772252160 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-12022448128 must be between -2147483

[fpc-devel] Progress on pure functions

2023-08-12 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

So I'm still working on pure functions and have pushed some merge 
requests that are indirectly related to it, mostly simplifying the node 
tree so it can more easily be collapsed into simple assignments (what 
pure functions should simplify to).  Negative testing is still limited, 
but I have stumbled across one potential problem.  Using the following 
code example:


program pure1a;
{$MODE OBJFPC}
{$COPERATORS ON}

function Factorial(N: Cardinal): Cardinal; pure;
  var
    X: Integer;
  begin
    Result := 1;
    for X := N downto 2 do
  Result *= X;
  end;

begin
  WriteLn(Factorial(32));
end.

For those not familiar with the factorial function, the result increases 
in value very quickly to the point that 13! > 2^32 (LongWord), 21! > 
2^64 (QWord) and 70! > 10^100 (a googol), so 32! is guaranteed to overflow.


Having not really handled this eventuality just yet, I decided to see 
what happens when the compiler runs it as is:


pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(6227020800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(27048749056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(19184179200 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(32068960256 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(34071216128 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-288522240 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(4006445056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-5193400320 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-898433024 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3396534272 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-17070227456 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-2102132736 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(2192834560 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-44144787456 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-1195114496 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3099852800 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-26292518912 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-522715136 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3772252160 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-12022448128 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(20698890240 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-775946240 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(3519021056 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-19398656000 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(53980692480 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-1853882368 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(2441084928 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-50054823936 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(41573941248 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-1375731712 must be between 0 and 4294967295)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(2919235584 must be between -2147483648 and 2147483647)
pure1a.pp(15,24) Warning: Range check error while evaluating constants 
(-39896219648 must be between -2147483648 and

[fpc-devel] "Ordinal expression expected" awkwardness

2023-07-19 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

So I've come across a bit of awkwardness with the compiler.  I'm not 
sure if it's a well-defined rule that I've overlooked, but in a 
for-loop, you can't use a 64-bit control variable when compiling for 
i386-win32 (and presumably other 32-bit platforms), but you can under 
x86_64-win64.  In my case, the upper bound is a 64-bit variable (of type 
TConstExprInt... a record type in the compiler source), so downsizing is 
not ideal (although I will likely put in code that will error out if the 
upper bound is too large due to the risk of an malicious inputs causing 
said bound to equal 2^63 - 1 or 2^64 - 1).


My example aside, should it be that there's a situation where pure Free 
Pascal code can build on a 64-bit compiler but not a 32-bit compiler?  
IFpermitting 64-bit control variables is too difficult for 32-bit 
systems, should they be forbidden entirely or at least throw a warning?


Kit

P.S. As the title implies, trying to use a QWord or Int64 as a for-loop 
control variable under i386-win32 causes an "Ordinal expression 
expected" error, but compiles without incident on x86_64-win64.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-23 Thread J. Gareth Moreton via fpc-devel

I just repeated the test on i386-win32 and it's the same thing... with 
one notable exception:


procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := LongInt(N) div LongInt(D);
  R := LongInt(N) mod LongInt(D);
end;

This reduces the operation to 32 bits, but does NOT generate a 
conditional check for a divisor of -1, so LongInt($8000) div 
LongInt(-1) will raise an exception, but won't raise an exception on 
x86_64-win64, thus behaviour between platforms is different.


Can others confirm this?

Kit

On 21/05/2023 00:00, J. Gareth Moreton via fpc-devel wrote:

Hi Florian,

Can I have a specific code example where this is absolutely necessary, 
and, if applicable, a target where it's known to cause a problem 
otherwise?  I've tried to create the example listed in the e-mail with 
the following:


procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := N div D;
  R := N mod D;
end;

I'm testing on x86_64-win64.  In this case, the "expanding to 64-bit" 
is necessary and so "doremoveinttypeconvs" is never called, thus the 
condition is not inserted.  The "idiv" operation is 64-bit and the 
result downsized to 32-bit - this also occurs if I do an explicit 
typecast on the result:


procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := LongInt(N div D);
  R := LongInt(N mod D);
end;

However, if I do an explicit typecast on the operands...

procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := LongInt(N) div LongInt(D);
  R := LongInt(N) mod LongInt(D);
end;

... then "doremoveinttypeconvs" is called and the condition inserted.  
I would argue here though that typecasting a value to a LongInt, where 
an output of $8000 is a real possibility, should raise an 
exception if you try to divide it by -1 since the programmer is asking 
to downsize values that could potentially be out of range.


Kit

On 19/05/2023 21:55, Florian Klämpfl via fpc-devel wrote:

Am 19.05.23 um 21:14 schrieb J. Gareth Moreton via fpc-devel:
So I need to ask... should the check for a divisor of -1 still be 
performed? 


Yes. This is the result of "down sizing" a division. In case of

longint(int64 div int64) can be converted only into longint(int64) 
div longint(int64) if this check is carried out. longint($8000 
div $) must silently result in $800 in this case.


The case of doing "min_int div -1", even with unsigned-to-signed 
typecasting, seems very contrived and the programmer should expect 
problems if "min_int" and "-1" appear as the operands.  Is there a 
specific example where this implicit check is absolutely necessary?  
As others have pointed out, silently returning "min_int" as the 
answer seems more unexpected (granted this is just the behaviour of 
an optimisation that converts the nodes equating to "x div -1" to 
"-x", and Intel's NEG instruction doesn't return an error if min_int 
is its input operand, but I can't be sure if the same applies to 
non-Intel processors and their equivalent instructions).


Kit

On 17/05/2023 09:51, J. Gareth Moreton via fpc-devel wrote:
Logically yes, but using 16-bit as an example, min_int is -32,768, 
and signed 16-bit integers range from -32,768 to 32,767. So -32,768 
÷ -1 = 32,768, which is out of range.  This is where the problem lies.


Internally, negation involves inverting all of the bits and then 
adding 1 (essentially how you subtract a number using two's 
complement), so min_int, which is 1000, becomes 
0111 and then, after incrementing, 1000, 
which is min_int again.


Kit

On 16/05/2023 13:13, Jean SUZINEAU via fpc-devel wrote:

Le 16/05/2023 à 01:47, Stefan Glienke via fpc-devel a écrit :

min_int div -1


"min_int div -1"  should give  "- min_int" ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-20 Thread J. Gareth Moreton via fpc-devel


Hi Florian,

Can I have a specific code example where this is absolutely necessary, 
and, if applicable, a target where it's known to cause a problem 
otherwise?  I've tried to create the example listed in the e-mail with 
the following:


procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := N div D;
  R := N mod D;
end;

I'm testing on x86_64-win64.  In this case, the "expanding to 64-bit" is 
necessary and so "doremoveinttypeconvs" is never called, thus the 
condition is not inserted.  The "idiv" operation is 64-bit and the 
result downsized to 32-bit - this also occurs if I do an explicit 
typecast on the result:


procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := LongInt(N div D);
  R := LongInt(N mod D);
end;

However, if I do an explicit typecast on the operands...

procedure DoDivMod(N, D: Int64; out Q, R: LongInt); noinline;
begin
  Q := LongInt(N) div LongInt(D);
  R := LongInt(N) mod LongInt(D);
end;

... then "doremoveinttypeconvs" is called and the condition inserted.  I 
would argue here though that typecasting a value to a LongInt, where an 
output of $8000 is a real possibility, should raise an exception if 
you try to divide it by -1 since the programmer is asking to downsize 
values that could potentially be out of range.


Kit

On 19/05/2023 21:55, Florian Klämpfl via fpc-devel wrote:

Am 19.05.23 um 21:14 schrieb J. Gareth Moreton via fpc-devel:
So I need to ask... should the check for a divisor of -1 still be 
performed? 


Yes. This is the result of "down sizing" a division. In case of

longint(int64 div int64) can be converted only into longint(int64) div 
longint(int64) if this check is carried out. longint($8000 div 
$) must silently result in $800 in this case.


The case of doing "min_int div -1", even with unsigned-to-signed 
typecasting, seems very contrived and the programmer should expect 
problems if "min_int" and "-1" appear as the operands.  Is there a 
specific example where this implicit check is absolutely necessary?  
As others have pointed out, silently returning "min_int" as the 
answer seems more unexpected (granted this is just the behaviour of 
an optimisation that converts the nodes equating to "x div -1" to 
"-x", and Intel's NEG instruction doesn't return an error if min_int 
is its input operand, but I can't be sure if the same applies to 
non-Intel processors and their equivalent instructions).


Kit

On 17/05/2023 09:51, J. Gareth Moreton via fpc-devel wrote:
Logically yes, but using 16-bit as an example, min_int is -32,768, 
and signed 16-bit integers range from -32,768 to 32,767. So -32,768 
÷ -1 = 32,768, which is out of range.  This is where the problem lies.


Internally, negation involves inverting all of the bits and then 
adding 1 (essentially how you subtract a number using two's 
complement), so min_int, which is 1000, becomes 
0111 and then, after incrementing, 1000, 
which is min_int again.


Kit

On 16/05/2023 13:13, Jean SUZINEAU via fpc-devel wrote:

Le 16/05/2023 à 01:47, Stefan Glienke via fpc-devel a écrit :

min_int div -1


"min_int div -1"  should give  "- min_int" ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-19 Thread J. Gareth Moreton via fpc-devel

That's useful to know - thanks Florian.  So it's possible to forego the 
-1 check if no downsizing occurs.


I suppose it makes sense... if we go by the signed 64-bit equivalents, 
$8000 div $ = $8000, which when 
typecast to a LongInt does result in a signed overflow that isn't 
checked without -Co, so the answer is $8000.  On the flip side, 
$8000 div $ = $8000, which when 
typecast to a LongInt is also equal to $8000, this time without an 
overflow.  Even though the original operand of $8000 is out 
of range for a LongInt, it's perfectly fine as an Int64.


If my logic is correct, the -1 check is not needed in the following 
conditions:


* The division is being upsized.
* There is no change in size and neither source is unsigned (e.g. 
Cardinal --> LongInt would require the check).


Currently, as given in my original code example, LongInt(LongInt div 
LongInt) is given the -1 check (assuming Integer = LongInt), and this 
just seems like a waste.  Would you approve this change Florian? (I'll 
also add a comment to explain about the downsizing situation)


Kit



On 19/05/2023 21:55, Florian Klämpfl via fpc-devel wrote:

Am 19.05.23 um 21:14 schrieb J. Gareth Moreton via fpc-devel:
So I need to ask... should the check for a divisor of -1 still be 
performed? 


Yes. This is the result of "down sizing" a division. In case of

longint(int64 div int64) can be converted only into longint(int64) div 
longint(int64) if this check is carried out. longint($8000 div 
$) must silently result in $800 in this case.


The case of doing "min_int div -1", even with unsigned-to-signed 
typecasting, seems very contrived and the programmer should expect 
problems if "min_int" and "-1" appear as the operands.  Is there a 
specific example where this implicit check is absolutely necessary?  
As others have pointed out, silently returning "min_int" as the 
answer seems more unexpected (granted this is just the behaviour of 
an optimisation that converts the nodes equating to "x div -1" to 
"-x", and Intel's NEG instruction doesn't return an error if min_int 
is its input operand, but I can't be sure if the same applies to 
non-Intel processors and their equivalent instructions).


Kit

On 17/05/2023 09:51, J. Gareth Moreton via fpc-devel wrote:
Logically yes, but using 16-bit as an example, min_int is -32,768, 
and signed 16-bit integers range from -32,768 to 32,767. So -32,768 
÷ -1 = 32,768, which is out of range.  This is where the problem lies.


Internally, negation involves inverting all of the bits and then 
adding 1 (essentially how you subtract a number using two's 
complement), so min_int, which is 1000, becomes 
0111 and then, after incrementing, 1000, 
which is min_int again.


Kit

On 16/05/2023 13:13, Jean SUZINEAU via fpc-devel wrote:

Le 16/05/2023 à 01:47, Stefan Glienke via fpc-devel a écrit :

min_int div -1


"min_int div -1"  should give  "- min_int" ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-19 Thread J. Gareth Moreton via fpc-devel

So I need to ask... should the check for a divisor of -1 still be 
performed? The case of doing "min_int div -1", even with 
unsigned-to-signed typecasting, seems very contrived and the programmer 
should expect problems if "min_int" and "-1" appear as the operands.  Is 
there a specific example where this implicit check is absolutely 
necessary?  As others have pointed out, silently returning "min_int" as 
the answer seems more unexpected (granted this is just the behaviour of 
an optimisation that converts the nodes equating to "x div -1" to "-x", 
and Intel's NEG instruction doesn't return an error if min_int is its 
input operand, but I can't be sure if the same applies to non-Intel 
processors and their equivalent instructions).


Kit

On 17/05/2023 09:51, J. Gareth Moreton via fpc-devel wrote:
Logically yes, but using 16-bit as an example, min_int is -32,768, and 
signed 16-bit integers range from -32,768 to 32,767. So -32,768 ÷ -1 = 
32,768, which is out of range.  This is where the problem lies.


Internally, negation involves inverting all of the bits and then 
adding 1 (essentially how you subtract a number using two's 
complement), so min_int, which is 1000, becomes 
0111 and then, after incrementing, 1000, which 
is min_int again.


Kit

On 16/05/2023 13:13, Jean SUZINEAU via fpc-devel wrote:

Le 16/05/2023 à 01:47, Stefan Glienke via fpc-devel a écrit :

min_int div -1


"min_int div -1"  should give  "- min_int" ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-17 Thread J. Gareth Moreton via fpc-devel

Logically yes, but using 16-bit as an example, min_int is -32,768, and 
signed 16-bit integers range from -32,768 to 32,767. So -32,768 ÷ -1 = 
32,768, which is out of range.  This is where the problem lies.


Internally, negation involves inverting all of the bits and then adding 
1 (essentially how you subtract a number using two's complement), so 
min_int, which is 1000, becomes 0111 and then, 
after incrementing, 1000, which is min_int again.


Kit

On 16/05/2023 13:13, Jean SUZINEAU via fpc-devel wrote:

Le 16/05/2023 à 01:47, Stefan Glienke via fpc-devel a écrit :

min_int div -1


"min_int div -1"  should give  "- min_int" ?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-15 Thread J. Gareth Moreton via fpc-devel


Thank you!  I knew it must have been something simple like that!

I just did a quick test with x86 assembly language - this is the code 
that used to run for dividing min_int by -1:


MOV EAX, $8000
NEG EAX

EAX contains $8000 as a result.  When using IDIV:

MOV EAX, $8000
MOV EDX, $ // Sign extension
MOV ECX, -1
IDIV ECX

Exception SIGFPE raised.

I understand now.  Thanks for pointing out min_int div -1.  I'll 
experiment a bit more though to see if generating the setup is necessary 
in all circumstances (e.g. if an unsigned operand is being expanded into 
a larger signed value, it isn't necessary). At the very least, I'll add 
a new test for this and also put a comment in the code so it's clear why 
-1 needs special treatment (the fact I got no regressions means this 
very specific set-up is not tested properly).


Kit

On 15/05/2023 17:43, Kirinn via fpc-devel wrote:

I didn't see a mention of it in this discussion thread, but dividing the
largest negative integer by -1 does cause an overflow error of some sort,
because of two's complement.

Just to make sure, do (min_int div -1) and (min_int mod -1) behave the same
way before and after this proposed change? And, what min_int is of course
depends on whether targeting a 32-bit or 64-bit system, so best check both
cases.

~Kirinn

On Mon, 15 May 2023 17:21:30 +0100
"J. Gareth Moreton via fpc-devel"  wrote:


I made a merge request that removes the comparison against -1.
x86_64-win64 and i386-win32 pass without any regressions, so the reason
behind the change still eludes me.  Maybe my fix for #39646 removes the
need for it?

https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/418

Kit

On 11/05/2023 23:04, J. Gareth Moreton via fpc-devel wrote:

Fair enough, but I would like an explanation as to why it's necessary,
because the reason for testing -1 in particular is very unclear, and I
wonder if there's a known misbehaviour with a particular division
function with -1.

Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-15 Thread J. Gareth Moreton via fpc-devel

I made a merge request that removes the comparison against -1. 
x86_64-win64 and i386-win32 pass without any regressions, so the reason 
behind the change still eludes me.  Maybe my fix for #39646 removes the 
need for it?


https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/418

Kit

On 11/05/2023 23:04, J. Gareth Moreton via fpc-devel wrote:
Fair enough, but I would like an explanation as to why it's necessary, 
because the reason for testing -1 in particular is very unclear, and I 
wonder if there's a known misbehaviour with a particular division 
function with -1.


Kit

On 11/05/2023 21:27, Wayne Sherman wrote:

On Thu, May 11, 2023 at 11:42 AM J. Gareth Moreton wrote:

This is the code block in question (ncnv.pas, starting at line 3397)

The git "blame" function shows who last made changes:
https://gitlab.com/freepascal.org/fpc/source/-/blame/main/compiler/ncnv.pas?page=4#L3396 



Most of that code was added 2 years ago in this commit:
https://gitlab.com/freepascal.org/fpc/source/-/commit/ea11517d27fa00f40b626e47213f0caa8832d155 




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-11 Thread J. Gareth Moreton via fpc-devel

Fair enough, but I would like an explanation as to why it's necessary, 
because the reason for testing -1 in particular is very unclear, and I 
wonder if there's a known misbehaviour with a particular division 
function with -1.


Kit

On 11/05/2023 21:27, Wayne Sherman wrote:

On Thu, May 11, 2023 at 11:42 AM J. Gareth Moreton wrote:

This is the code block in question (ncnv.pas, starting at line 3397)

The git "blame" function shows who last made changes:
https://gitlab.com/freepascal.org/fpc/source/-/blame/main/compiler/ncnv.pas?page=4#L3396

Most of that code was added 2 years ago in this commit:
https://gitlab.com/freepascal.org/fpc/source/-/commit/ea11517d27fa00f40b626e47213f0caa8832d155


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-11 Thread J. Gareth Moreton via fpc-devel

This is the code block in question (ncnv.pas, starting at line 3397) - 
if anyone can explain why it has to be set up this way, or add comments 
to the code, I will be most grateful (it's run for the following node 
types: subn, addn, muln, divn, modn, xorn, andn, orn, shln, shrn):


  exclude(n.flags,nf_internal);
  if not forceunsigned and
 is_signed(n.resultdef) then
    begin
  originaldivtree:=nil;
  if n.nodetype in [divn,modn] then
    originaldivtree:=n.getcopy;
doremoveinttypeconvs(level+1,tbinarynode(n).left,signedtype,false,signedtype,unsignedtype);
doremoveinttypeconvs(level+1,tbinarynode(n).right,signedtype,false,signedtype,unsignedtype);
  n.resultdef:=signedtype;
  if n.nodetype in [divn,modn] then
    begin
  newblock:=internalstatements(newstatements);
tempnode:=ctempcreatenode.create(n.resultdef,n.resultdef.size,tt_persistent,true);
  addstatement(newstatements,tempnode);
  addstatement(newstatements,cifnode.create_internal(
caddnode.create_internal(equaln,tbinarynode(n).right.getcopy,cordconstnode.create(-1,n.resultdef,false)),
  cassignmentnode.create_internal(
    ctemprefnode.create(tempnode),
cmoddivnode.create(n.nodetype,tbinarynode(originaldivtree).left.getcopy,cordconstnode.create(-1,tbinarynode(originaldivtree).right.resultdef,false))
  ),
  cassignmentnode.create_internal(
    ctemprefnode.create(tempnode),n
  )
    )
  );
addstatement(newstatements,ctempdeletenode.create_normal_temp(tempnode));
addstatement(newstatements,ctemprefnode.create(tempnode));
  n:=newblock;
  do_typecheckpass(n);
  originaldivtree.free;
    end;
    end

(the new division/modulus by -1 is then converted elsewhere)

Kit

On 11/05/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
P.S. I found the code that adds the conditional checks; it's 
"doremoveinttypeconvs" in the ncnv unit.  However, it's very unclear 
as to WHY it's doing it as there's no comments around the code block.


Kit

On 11/05/2023 15:39, J. Gareth Moreton via fpc-devel wrote:
It does seem odd.  In a practical sense, the only time I can see -1 
being a common input among other random numbers is if it's an error 
value, in which case you would most likely do special handling rather 
than pass it through a division operation.  With the slowdown that 
comes from additional branch prediction, it just seems like 
unnecessary fluff, but I need to double-check to see if there's a 
very good reason behind their generation (if it's a platform-specific 
problem, it should be moved to that platform's specific first pass)  
Now I just need to find out where those nodes are generated - they're 
proving elusive!


Note that using constant divisors uses a different optimisation, so 
this only applies to variable divisors.


Kit

On 11/05/2023 12:07, Stefan Glienke via fpc-devel wrote:
Looks like a rather disadvantageous way to avoid the idiv 
instruction because x div -1 = -x and x mod -1 = 0.


I ran a quick benchmark doing a lot of integer divisions where 
sometimes (randomly) the divisor was -1. When the occurence was rare 
enough (~5%) the performance was not impacted, the higher the 
occurence of -1 was the slower it became to almost half as fast. 
Only when less than 5% of the divisors were *not* -1 the performance 
was better up to twice as fast when all divisors were -1. Of couse 
ymmv as it depends on the CPU and the branch predictor behavior but 
it shows that this "optimization" is hardly any good.


I cannot think of a realistic case where 95% of your divisors are -1 
and you really need to save those few extra cycles of calling idiv.


On 11/05/2023 11:04 CEST J. Gareth Moreton via fpc-devel 
 wrote:


  Hi everyone,

I need to ask a question about how division nodes are set up (I'm
looking at possible optimisation techniques).  I've written the
following procedure:

procedure DoDivMod(N, D: Integer; out Q, R: Integer);
begin
    Q := N div D;
    R := N mod D;
end;

Fairly simple and to the point.  However, even before the first node
pass, the following node tree is generated for an integer division
operation:


     
    
   flags="nf_internal">

      
     D
      
      rangecheck="FALSE">

     -1
      
   
    
    
   
      flags="nf_write"

id="$7C585E10">
     LongInt
ti_may_be_in_reg
tt_persistent
      
      
     
    N
     
      
   
    
    
   
      flags="nf_write"

id="$7C585E10">
     LongInt
ti_may_be_in_reg
tt_persistent
      
      
     
    N

Re: [fpc-devel] Division nodes

2023-05-11 Thread J. Gareth Moreton via fpc-devel

P.S. I found the code that adds the conditional checks; it's 
"doremoveinttypeconvs" in the ncnv unit.  However, it's very unclear as 
to WHY it's doing it as there's no comments around the code block.


Kit

On 11/05/2023 15:39, J. Gareth Moreton via fpc-devel wrote:
It does seem odd.  In a practical sense, the only time I can see -1 
being a common input among other random numbers is if it's an error 
value, in which case you would most likely do special handling rather 
than pass it through a division operation.  With the slowdown that 
comes from additional branch prediction, it just seems like 
unnecessary fluff, but I need to double-check to see if there's a very 
good reason behind their generation (if it's a platform-specific 
problem, it should be moved to that platform's specific first pass)  
Now I just need to find out where those nodes are generated - they're 
proving elusive!


Note that using constant divisors uses a different optimisation, so 
this only applies to variable divisors.


Kit

On 11/05/2023 12:07, Stefan Glienke via fpc-devel wrote:
Looks like a rather disadvantageous way to avoid the idiv instruction 
because x div -1 = -x and x mod -1 = 0.


I ran a quick benchmark doing a lot of integer divisions where 
sometimes (randomly) the divisor was -1. When the occurence was rare 
enough (~5%) the performance was not impacted, the higher the 
occurence of -1 was the slower it became to almost half as fast. Only 
when less than 5% of the divisors were *not* -1 the performance was 
better up to twice as fast when all divisors were -1. Of couse ymmv 
as it depends on the CPU and the branch predictor behavior but it 
shows that this "optimization" is hardly any good.


I cannot think of a realistic case where 95% of your divisors are -1 
and you really need to save those few extra cycles of calling idiv.


On 11/05/2023 11:04 CEST J. Gareth Moreton via fpc-devel 
 wrote:


  Hi everyone,

I need to ask a question about how division nodes are set up (I'm
looking at possible optimisation techniques).  I've written the
following procedure:

procedure DoDivMod(N, D: Integer; out Q, R: Integer);
begin
    Q := N div D;
    R := N mod D;
end;

Fairly simple and to the point.  However, even before the first node
pass, the following node tree is generated for an integer division
operation:


     
    
   
      
     D
      
      rangecheck="FALSE">

     -1
      
   
    
    
   
      flags="nf_write"

id="$7C585E10">
     LongInt
ti_may_be_in_reg
tt_persistent
      
      
     
    N
     
      
   
    
    
   
      flags="nf_write"

id="$7C585E10">
     LongInt
ti_may_be_in_reg
tt_persistent
      
      
     
    N
     
     
    D
     
      
   
    
     


Something similar is made for "mod" as well.  I have to ask 
though... is

it really necessary to check to see if the divisor is -1 and have a
distinct assignment for it?  It's a bit of a rare edge case that 
usually

just slows things down since it tends to add a comparison and a
conditional jump to the final assembly language.  Is there some
anomalous behaviour to a processor's division routine if the divisor 
is -1?


At the very least, would it be possible to remove the conditional check
when compiling under -Os?

(I intend to see if it's possible to merge "N div D" and "N mod D" on
x86, and possibly other processors that have a combined DIV/MOD 
operator).


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Division nodes

2023-05-11 Thread J. Gareth Moreton via fpc-devel

It does seem odd.  In a practical sense, the only time I can see -1 
being a common input among other random numbers is if it's an error 
value, in which case you would most likely do special handling rather 
than pass it through a division operation.  With the slowdown that comes 
from additional branch prediction, it just seems like unnecessary fluff, 
but I need to double-check to see if there's a very good reason behind 
their generation (if it's a platform-specific problem, it should be 
moved to that platform's specific first pass)  Now I just need to find 
out where those nodes are generated - they're proving elusive!


Note that using constant divisors uses a different optimisation, so this 
only applies to variable divisors.


Kit

On 11/05/2023 12:07, Stefan Glienke via fpc-devel wrote:

Looks like a rather disadvantageous way to avoid the idiv instruction because x 
div -1 = -x and x mod -1 = 0.

I ran a quick benchmark doing a lot of integer divisions where sometimes (randomly) the 
divisor was -1. When the occurence was rare enough (~5%) the performance was not 
impacted, the higher the occurence of -1 was the slower it became to almost half as fast. 
Only when less than 5% of the divisors were *not* -1 the performance was better up to 
twice as fast when all divisors were -1. Of couse ymmv as it depends on the CPU and the 
branch predictor behavior but it shows that this "optimization" is hardly any 
good.

I cannot think of a realistic case where 95% of your divisors are -1 and you 
really need to save those few extra cycles of calling idiv.


On 11/05/2023 11:04 CEST J. Gareth Moreton via fpc-devel 
 wrote:

  
Hi everyone,


I need to ask a question about how division nodes are set up (I'm
looking at possible optimisation techniques).  I've written the
following procedure:

procedure DoDivMod(N, D: Integer; out Q, R: Integer);
begin
    Q := N div D;
    R := N mod D;
end;

Fairly simple and to the point.  However, even before the first node
pass, the following node tree is generated for an integer division
operation:


     
    
   
      
     D
      
      
     -1
      
   
    
    
   
      
     LongInt
ti_may_be_in_reg
     tt_persistent
      
      
     
    N
     
      
   
    
    
   
      
     LongInt
ti_may_be_in_reg
     tt_persistent
      
      
     
    N
     
     
    D
     
      
   
    
     


Something similar is made for "mod" as well.  I have to ask though... is
it really necessary to check to see if the divisor is -1 and have a
distinct assignment for it?  It's a bit of a rare edge case that usually
just slows things down since it tends to add a comparison and a
conditional jump to the final assembly language.  Is there some
anomalous behaviour to a processor's division routine if the divisor is -1?

At the very least, would it be possible to remove the conditional check
when compiling under -Os?

(I intend to see if it's possible to merge "N div D" and "N mod D" on
x86, and possibly other processors that have a combined DIV/MOD operator).

Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Division nodes

2023-05-11 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I need to ask a question about how division nodes are set up (I'm 
looking at possible optimisation techniques).  I've written the 
following procedure:


procedure DoDivMod(N, D: Integer; out Q, R: Integer);
begin
  Q := N div D;
  R := N mod D;
end;

Fairly simple and to the point.  However, even before the first node 
pass, the following node tree is generated for an integer division 
operation:



   
  
 
    
   D
    
    
   -1
    
 
  
  
 
    id="$7C585E10">

   LongInt
ti_may_be_in_reg
   tt_persistent
    
    
   
  N
   
    
 
  
  
 
    id="$7C585E10">

   LongInt
ti_may_be_in_reg
   tt_persistent
    
    
   
  N
   
   
  D
   
    
 
  
   


Something similar is made for "mod" as well.  I have to ask though... is 
it really necessary to check to see if the divisor is -1 and have a 
distinct assignment for it?  It's a bit of a rare edge case that usually 
just slows things down since it tends to add a comparison and a 
conditional jump to the final assembly language.  Is there some 
anomalous behaviour to a processor's division routine if the divisor is -1?


At the very least, would it be possible to remove the conditional check 
when compiling under -Os?


(I intend to see if it's possible to merge "N div D" and "N mod D" on 
x86, and possibly other processors that have a combined DIV/MOD operator).


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Curious about the effect of all the new optimizations....

2023-03-01 Thread J. Gareth Moreton via fpc-devel


On 01/03/2023 13:10, Sven Barth wrote:

It's a German proverb: "Mühsam ernährt sich das Eichhörnchen"

Regards,
Sven


Thanks Sven!

Kit


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Curious about the effect of all the new optimizations....

2023-03-01 Thread J. Gareth Moreton via fpc-devel


On 01/03/2023 13:11, Martin Frb via fpc-devel wrote:
Hence testing back to  3.2.3 ( unfortunately 3.2.2 has a bug that 
matters in this code)


Also, I didn't expect any huge diffs, just wanted to see if anything 
can be noted at all. (and if lucky, in that test I run)


I did a test on a more limited scope (testing only a handful of 
functions. That test runs 4 Min 20 sec under 3.2.3.
And 2 extra seconds with 3.3.1.  But then I only had 2 sample runs for 
each fpc version


2 seconds out of 4:20 is not conclusive unfortunately, unless you're 
able to exactly control the machine state each time, which is next to 
impossible in the modern day.  I am curious of the slowdown though, even 
if it is very slight.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Curious about the effect of all the new optimizations....

2023-03-01 Thread J. Gareth Moreton via fpc-devel

My peephole optimisations mostly save only a handful of cycles each time 
which probably won't add up to much for a relatively short test.  The 
most major optimisation I can think of, although I'm not quite sure when 
it was merged, is the method of replacing divisions by a constant with 
an equivalent reciprocal multiplication.  You'll see the biggest savings 
there.  There's other difficulties like processors being intelligent 
with caching and out of order execution, for example, that are 
disguising some inefficiencies.  And some seek only to reduce code size 
with no loss of speed.


What are your timings like when compiling with COREAVX or COREAVX2?  A 
couple of recent peephole optimizations make use of BMI1 and BMI2.


I can't remember the proverb that Florian used, but it essentially boils 
down to very small changes, individually not amounting to much, but 
which accumulate into a noticable difference when in large numbers.


Kit

On 01/03/2023 10:32, Martin Frb via fpc-devel wrote:
So for a while now fpc 3.3.1 receives new optimizations => which is 
great / big fan of it.


And hence I thought, lets see how much of an impact they have. And in 
my test, they had none :(

Wondering if any one else has measured them?

My tests:
Win-10 64 bit
3.3.1  905c485ff413cd48f98891e2075c814759d0c6f1
3.2.3  2022-02-04
both compilers with each O2 and O4

Using the testcase for FpDebug (which runs a decent spread of code).
Testcase with O2 and O3

And I got no noticeable difference.
I also tried {$CodeAlign proc=32 loop=32} for O2 (test and fpc), also 
no diff.



O2 / fpc: o2 323
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.406
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.063
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.609
O2 / fpc: o2 331
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.251
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.031
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  21.531


O3 / fpc: o2 323
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.687
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.281
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.469
O3 / fpc: o2 331
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  23.203
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.250
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.140


O3 / fpc: o4 323
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  23.063
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.250
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.875
O3 / fpc: o4 331
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.577
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.094
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.235


{$CodeAlign proc=32 loop=32}
O2 / fpc: def 323
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.453
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.328
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.656
O2 / fpc: def 331
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.079
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  22.234
TestWatchesValue_fpc 264_Dwarf_32Bit_FpDebug   :  21.984

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unexpected "Range check error while evaluating constants" when compiling for Win64

2023-02-12 Thread J. Gareth Moreton via fpc-devel

Yeah, of course, since LongInt($8001), before typecasting to HKEY, 
is specifically a signed constant.  So obvious!!


Kit

On 12/02/2023 20:43, Bart via fpc-devel wrote:

On Sun, Feb 12, 2023 at 6:26 PM J. Gareth Moreton via fpc-devel
 wrote:

If HKey is unsigned, then yes, the definition should be
HKEY(DWORD($8001)).  I do wonder why Win64 produces that particular
error though because the final destination is an unsigned 64-bit integer.

Because QWord(some not so big negative number) = some large positive number.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unexpected "Range check error while evaluating constants" when compiling for Win64

2023-02-12 Thread J. Gareth Moreton via fpc-devel

If HKey is unsigned, then yes, the definition should be 
HKEY(DWORD($8001)).  I do wonder why Win64 produces that particular 
error though because the final destination is an unsigned 64-bit integer.


Kit

On 12/02/2023 17:17, Bart via fpc-devel wrote:

Hi,

This code compiles happily for Win32, but refuses to compile for Win64:

program test;
{$mode objfpc}
{$h+}

uses
   Registry;

type
   TA = class
   private
 FRootKey: HKey;
   public
 //Win64/X86_64 Error: Range check error while evaluating constants
(18446744071562067969 must be between -2147483648 and 4294967295)
 property RootKey: HKey read FRootKey write FRootKey default
HKEY_CURRENT_USER;  //-2147483647
   end;

begin
end.
===

Is this "by design" or is it a bug?

On Windows HKEY_CURRENT_USER is defined as HKEY(longint($8001));
HKey is defined as HANDLE = System.THandle = QWord on 64-bit, but it
is DWord on 32-bit.
So infact the value of HKEY_CURRENT_USER would be 2147483649 (as a
DWord) on 32-bit, and 18446744071562067969 (as a QWord) on 64-bit?

Shouldn't HKEY_CURRENT_USER et al. be defined as HKEY(DWORD(somevalue)) instead?

Tested with fpc 3.2.2 and fpc main 3.3.1-2495-g6453af40d8

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Fixing bugs

2023-02-02 Thread J. Gareth Moreton via fpc-devel


Hi everyone,

I've just made an update to 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/366, the 
request that fixes i40111, removing the band-aid in aoptx86 and 
hopefully still fixing the original bug.  Can everyone confirm that 
i386-linux no longer crashes?


In the meantime I'm just testing a fix to i40129 and will let you know 
when it's ready.


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Incorrect hint (5023) "unit not used", if unit is only used in a conditional compiler expression (like: {$IF ..})

2023-01-13 Thread J. Gareth Moreton via fpc-devel

In my opinion, yes, report this as a bug.  Sure, it's what I'd consider 
"low priority" since it's just an incorrect informative hint rather than 
something critical, but it's a bug nonetheless.


Kit

On 13/01/2023 11:54, Bart via fpc-devel wrote:

Consider the follwoing program:
===
program test;

uses
   Version;

begin
   {$if TheVersion >= 1}
   writeln('Version 1 or higher');
   {$else}
   writeln('Version < 1');
   {$endif}
end.
===
unit version;

interface
const
   TheVersion = 1;

implementation

end.
===
Compile with -vh
You get the hint:  Unit "version" not used in test

This is obviously not true, without the unit version, the program won't compile.

Should I report this as a bug?


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Happy New Year!

2022-12-31 Thread J. Gareth Moreton via fpc-devel


Happy New Year everybody!  Free Pascal lives on!

Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Progress on pure functions

2022-12-15 Thread J. Gareth Moreton via fpc-devel


The field sharing refers to this:

"What I mean is that if a function is marked as "pure" or "inline" (or 
both), only one copy of the unoptimised node tree is stored in the 
"inlininginfo" field, and both "pass1_pure" and "pass1_inline" duplicate 
this tree and transform it as needed. Because only the unoptimised tree 
is stored, I felt there was no need to store this twice (doing so would 
also increase the size of PPU files)."


This copy of the tree is stored for inline functions before the first 
pass is executed.  Pure functions copy and analyse the same tree.  If a 
function is both pure and inline, these initial trees will always be 
identical, so it is redundant to store two copies.


Kit


On 16/12/2022 06:44, Sven Barth wrote:

Am 16.12.2022 um 02:02 schrieb J. Gareth Moreton via fpc-devel:
The purity analysis process is very dependent on the node tree being 
as clean as possible, and so depends on a fair few merge requests 
that have not yet been approved.  I'm guessing Florian and Jonas and 
others are somewhat busy, what with being December and all.


- https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/232 - 
strips unnecessary typeconv nodes (this helps a lot with constant 
propagation).
- https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/342 - 
strips excess nothing nodes.


I didn't say that you shouldn't clean up the tree for your purity 
analysis (that sounds so wrong :P ), I simply asked what you meant 
with “share the same field” and if it is what I think it is then it's 
a bad idea and you shouldn't “share the same field” but introduce your 
own.


Regards,
Sven


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

1 2 3 4 5 >

1 - 100 of 449 matches

Mail list logo