Re: [fpc-devel] LEA instruction speed

J. Gareth Moreton via fpc-devel Sat, 07 Oct 2023 11:26:22 -0700

I'm still slightly curious, but if full optimisations make better code,then indeed it's probably not worth the effort.

Your timings are incredibly helpful - thank you! If I understand, AMDA9 is the Excavator architecture, which implies that AMD processorsdon't suffer from the same latency with complex LEA instructions asIntel processors do.

Looking at Agner Fog's tables, it looks like the slow LEA instructionsonly came about at Sandy Bridge, which for Free Pascal I think lines upwith COREAVX. Even the Pentium-era processors have a 1-cycle LEA, andyour testing on an AMD 486 shows it is at least as fast as two ADDs in adependency chain. That should be all the information I need - thanks again!


Kit

On 07/10/2023 19:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote:
That's interesting; I am interested to see the assembly output for the
Pascal control cases.  As for the 64-bit version, that was my fault
since the assembly language is for Microsoft's ABI rather than the
System V ABI, so it was checking a register with an undefined value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

   Pascal control case: 2.0 ns/call
 Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call
OK. My results for the AMD A9 CPU mentioned previously and 32-bittrunk compiler (Linux) are:
   Pascal control case: 2.3 ns/call
 Using LEA instruction: 1.2 ns/call
Using ADD instructions: 1.5 ns/call
The same machine, the same operating environment, but a 64-bit trunkcompiler:
   Pascal control case: 3.6 ns/call
 Using LEA instruction: 0.9 ns/call
Using ADD instructions: 1.3 ns/call
I tried compiling and running the test with all of FPC 2.0.4, 2.2.4,2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized thatall results (for both the assembler and Pascal versions) compiled withanything older than 3.2.2 are an order of magnitude faster than with3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8ns/call with Pascal / 4 ns/call with assembler versions). This meansthat the comparison is obviously spoiled with something unrelated.Moreover, I noticed that when compiling with the highest level ofoptimizations, the Pascal version compiled for i386 is as fast or evenlittle bit faster than the assembler version. I didn't do thatpreviously, thus the longer time for the older compiler versionprobably isn't relevant. From this point of view, it probably doesn'tmake sense to spend time on comparing the generated code.
Tomas
On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,
Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.
OK, I tried it now. First of all, after turning on the old machine,I realized that it wasn't Intel but AMD 486 DX4 - sorry for my badmemory. :-( I compiled and ran the test under OS/2 there (I was toolazy to boot it to DOS ;-) ), but I assume that it shouldn't makeany substantial difference. The ADD and LEA results were basicallythe same there, both around 100 ns / call. The Pascal result wasaround twice as long. Interestingly, the Pascal result for FPC 3.2.2was around 10% longer than the same source compiled with FPC 2.0.3(the assembler versions were obviously the same for both FPCversions; I tried compiling it also with FPC 1.0.10 and theassembler versions were more than three times slower due to missingsupport for the nostackframe directive).
I tested it under the AMD Athlon 1 GHz machine as well and again,the results for LEA and ADD are basically equal (both 3.1 ns/call)and the result for Pascal slightly more than twice (7.3 ns/call).However, rather surprisingly for me, the overall test run was _much_longer there?! Finally, I tried compiling the test on a 64-bitmachine (AMD A9-9425) with Linux (compiled for 64-bits with FPC3.2.3 compiled from a fresh 3.2 branch). The Pascal version showsabout 4 ns/call, but the assembler version runs forever - well,certainly much longer than my patience lasts. I haven't tried toanalyze the reasons, but that's what I get.
Tomas
On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton viafpc-devel" <[email protected]> wrote:
Hii Kit,
This is mainly to Florian, but also to anyone else who can answerthe question - at which point did a complex LEA instruction(using all three input operands and some other specificcircumstances) get slow? Preliminary research suggests the 486was when it gained extra latency, and then Sandy Bridge when itgot particularly bad. Icy Lake seems to be the architecturewhere faster LEA instructions are reintroduced, but I'm not sureabout AMD processors.
I cannot answer your question, but if you prepare a test program,I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHzmachines if it helps you in any way (at least I hope the 486 DX2machine should be still able to start ;-) ).
Tomas

_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] LEA instruction speed

Reply via email to