Re: [fpc-devel] LEA instruction speed
I should have figured. Thank you! Kit On 27/10/2023 01:51, Nikolay Nikolov via fpc-devel wrote: On 10/11/23 11:21, Tomas Hajny via fpc-devel wrote: On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote: Sweet, thank you. Would you be willing to share your modified test's source? I was worried that if CPUID wasn't present it would cause a SIGILL. Sure, attached, but I didn't do anything special - I modified it in a way allowing easy disabling of this detection for x86 by disabling definition of a conditional symbol added to the source and I was prepared to recompile with the functionality disabled on the old AMD DX4 if needed. However, I didn't need to do so - the AMD DX4 machine simply ignored it and chose the branch used in case of missing support for the particular CPUID function. I have no idea if this might be due to some protection in OS/2 Warp 4 (used for compiling and running the test on that machine) potentially masking that exception, or what was the reason. Apparently, it should be possible to detect CPUID availability (albeit not 100% reliably), see https://wiki.osdev.org/CPUID, but I didn't use that. There's CPUID support detection code in the Free Pascal RTL for i8086 and i386. It's in unit cpu: function cpuid_support: boolean; Nikolay Tomas On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote: On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote: I'm all for receiving results for all kinds of processor, as it helps me to make more informed choices on flags as well as confirming that Agner Fog''s instruction tables are correct. Also, results for older processors can be hard to come by sometimes. Currently, most architectures have a fast LEA, and the default "Athlon" option lines up with this. Of the Intel architectures, the speed slows down on COREAVX onwards (COREI is fine), so I added a new COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark the point where LEA is fast again (its 16-bit version is also fast, unlike Zen 3). In the meantime I'll be looking at the benchmarking code that Stefan provided to see if it can and should be integrated. Thanks again everyone for the results you're giving. Alright, fine (I modified your test to include the CPU name as well if possible and added an IFDEFed distinction of 32-bits versus 64-bits): 32-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.84 ns/call 64-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.85 ns/call 32-bits: CPU = AMD Athlon(tm) Processor -- Pascal control case: 6.10 ns/call Using LEA instruction: 3.40 ns/call Using ADD instructions: 3.40 ns/call 32-bits: (AMD DX4 100 MHz - no CPUID name) Pascal control case: 123 ns/call Using LEA instruction: 72 ns/call Using ADD instructions: 73 ns/call Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On 10/11/23 11:21, Tomas Hajny via fpc-devel wrote: On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote: Sweet, thank you. Would you be willing to share your modified test's source? I was worried that if CPUID wasn't present it would cause a SIGILL. Sure, attached, but I didn't do anything special - I modified it in a way allowing easy disabling of this detection for x86 by disabling definition of a conditional symbol added to the source and I was prepared to recompile with the functionality disabled on the old AMD DX4 if needed. However, I didn't need to do so - the AMD DX4 machine simply ignored it and chose the branch used in case of missing support for the particular CPUID function. I have no idea if this might be due to some protection in OS/2 Warp 4 (used for compiling and running the test on that machine) potentially masking that exception, or what was the reason. Apparently, it should be possible to detect CPUID availability (albeit not 100% reliably), see https://wiki.osdev.org/CPUID, but I didn't use that. There's CPUID support detection code in the Free Pascal RTL for i8086 and i386. It's in unit cpu: function cpuid_support: boolean; Nikolay Tomas On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote: On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote: I'm all for receiving results for all kinds of processor, as it helps me to make more informed choices on flags as well as confirming that Agner Fog''s instruction tables are correct. Also, results for older processors can be hard to come by sometimes. Currently, most architectures have a fast LEA, and the default "Athlon" option lines up with this. Of the Intel architectures, the speed slows down on COREAVX onwards (COREI is fine), so I added a new COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark the point where LEA is fast again (its 16-bit version is also fast, unlike Zen 3). In the meantime I'll be looking at the benchmarking code that Stefan provided to see if it can and should be integrated. Thanks again everyone for the results you're giving. Alright, fine (I modified your test to include the CPU name as well if possible and added an IFDEFed distinction of 32-bits versus 64-bits): 32-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.84 ns/call 64-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.85 ns/call 32-bits: CPU = AMD Athlon(tm) Processor -- Pascal control case: 6.10 ns/call Using LEA instruction: 3.40 ns/call Using ADD instructions: 3.40 ns/call 32-bits: (AMD DX4 100 MHz - no CPUID name) Pascal control case: 123 ns/call Using LEA instruction: 72 ns/call Using ADD instructions: 73 ns/call Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
It was a thought that crossed my mind when Stefan pointed out the translated Google Benchmark, but given that it hasn't yet been adapted to work outside of i386 and x86_64, you are right that it probably shouldn't be used for the time being. The framework uses CPU timings to decide how many iterations to run, so obtaining such metrics is essential for its operation. While I can probably get it to work for aarch64-linux, I don't know the first thing about polling CPUs on platforms I don't have access to! It was a nice experiment in the meantime. In regards to "blea" being in the test suite, i haven't yet put it into the normal test suite (using an include wrapper like I did with "tests/bench/bcase.pp") since it's primarily to evalutate LEA timings rather than testing compiler efficiency. It's more of a 'utility' test. The feedback from others has proven useful in determining the correctness of the new optimisation hint, which I intend to use to make the i386/x86_64 peephole optimizer smarter in regards to using LEA statements. Kit On 13/10/2023 16:36, Tomas Hajny via fpc-devel wrote: On 2023-10-13 17:08, J. Gareth Moreton via fpc-devel wrote: Interesting! That's a bug report to send to the maintainers of the framework. I'll need to have them fix it before I'd be willing to try again with its use in FPC. Removed the reference. Apologies - I'm rushing a bit. BTW, it's IMHO questionable whether a benchmark framework restricted to just a subset of targets supported for the given architecture should be really used within FPC source codes (if that's your potential intention anyway). If it was intended for the testsuite, it would immediately fail at compile time when checking the testsuite under some other target, because the test doesn't specify that its use should be restricted to certain targets (it's only restricted for i386 and x86_64 regardless of the operating system - that's fine for the original version, but not for the benchmark framework). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On 2023-10-13 17:08, J. Gareth Moreton via fpc-devel wrote: Interesting! That's a bug report to send to the maintainers of the framework. I'll need to have them fix it before I'd be willing to try again with its use in FPC. Removed the reference. Apologies - I'm rushing a bit. BTW, it's IMHO questionable whether a benchmark framework restricted to just a subset of targets supported for the given architecture should be really used within FPC source codes (if that's your potential intention anyway). If it was intended for the testsuite, it would immediately fail at compile time when checking the testsuite under some other target, because the test doesn't specify that its use should be restricted to certain targets (it's only restricted for i386 and x86_64 regardless of the operating system - that's fine for the original version, but not for the benchmark framework). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
This one's for you Stefan! https://github.com/spring4d/benchmark/issues/4 Kit On 13/10/2023 16:03, Tomas Hajny via fpc-devel wrote: On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote: GetLogicalProcessorInformation returns a Boolean - if false, an error occurred, and is handled as follows: DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation: ' + GetLastError.ToString); GetLastError = 8 indicates "out of memory", which I will say is odd. Nevertheless, because of such teething problems with the framework, I'm going to remove it from "blea" for now. As it stands, please find attached the test that appears in the merge request: https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502 The attached version still contained reference to the framework. The problem with 32-bit compilation of the framework was due to a missing stdcall calling convention in the GetLogicalProcessorInformation declaration. Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Interesting! That's a bug report to send to the maintainers of the framework. I'll need to have them fix it before I'd be willing to try again with its use in FPC. Removed the reference. Apologies - I'm rushing a bit. Kit On 13/10/2023 16:03, Tomas Hajny via fpc-devel wrote: On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote: GetLogicalProcessorInformation returns a Boolean - if false, an error occurred, and is handled as follows: DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation: ' + GetLastError.ToString); GetLastError = 8 indicates "out of memory", which I will say is odd. Nevertheless, because of such teething problems with the framework, I'm going to remove it from "blea" for now. As it stands, please find attached the test that appears in the merge request: https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502 The attached version still contained reference to the framework. The problem with 32-bit compilation of the framework was due to a missing stdcall calling convention in the GetLogicalProcessorInformation declaration. Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel { %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} {$DEFINE DETECTCPU} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; var CPUName: array[0..48] of Char; {$ifdef DETECTCPU} function FillBrandName: Boolean; assembler; nostackframe; asm {$ifdef CPUX86_64} PUSH RBX {$else CPUX86_64} PUSH EBX {$endif CPUX86_64} MOV EAX, $8000 CPUID CMP EAX, $8004 JB @Unavailable {$ifdef CPUX86_64} LEA R8, [RIP + CPUName] {$endif CPUX86_64} MOV EAX, $8002 CPUID {$ifdef CPUX86_64} MOV [R8], EAX MOV [R8 + 4], EBX MOV [R8 + 8], ECX MOV [R8 + 12], EDX {$else CPUX86_64} MOV [CPUName], EAX MOV [CPUName + 4], EBX MOV [CPUName + 8], ECX MOV [CPUName + 12], EDX {$endif CPUX86_64} MOV EAX, $8003 CPUID {$ifdef CPUX86_64} MOV [R8 + 16], EAX MOV [R8 + 20], EBX MOV [R8 + 24], ECX MOV [R8 + 28], EDX {$else CPUX86_64} MOV [CPUName + 16], EAX MOV [CPUName + 20], EBX MOV [CPUName + 24], ECX MOV [CPUName + 28], EDX {$endif CPUX86_64} MOV EAX, $8004 CPUID {$ifdef CPUX86_64} MOV [R8 + 32], EAX MOV [R8 + 36], EBX MOV [R8 + 40], ECX MOV [R8 + 44], EDX MOV BYTE PTR [R8 + 48], 0 {$else CPUX86_64} MOV [CPUName + 32], EAX MOV [CPUName + 36], EBX MOV [CPUName + 40], ECX MOV [CPUName + 44], EDX MOV BYTE PTR [CPUName + 48], 0 {$endif CPUX86_64} MOV AL, 1 JMP @ExitBrand @Unavailable: XOR AL, AL @ExitBrand: {$ifdef CPUX86_64} POP RBX {$else CPUX86_64} POP EBX {$endif CPUX86_64} end; {$else DETECTCPU} function FillBrandName: Boolean; inline; begin Result := False; end; {$endif DETECTPU} function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X - 2023406815] {+$87654321 in decimal} XOR Input, Y DEC Y JNZ @Loop2 MOV Result, Input end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime; time: double; reps: cardinal; begin Result := Z; reps := 0; Write(name, ': '); start := Now; repeat inc(reps); Result := proc(Result, X, internal_reps); until (reps >= 10); time := Now - start) * SecsPerDay) * 1e9) / internal_reps) / reps; WriteLn(time:0:(2 * ord(time < 10)), ' ns/call'); end; var Results: array[0..2] of LongWord; FailureCode, X: Integer; begin {$ifdef CPUX86_64} Write('64-bit'); {$else CPUX86_64} Write('32-bit'); {$endif CPUX86_64} if FillBrandName then begin WriteLn(' CPU = ', CpuName); X := 0; while CpuName[X] <> #0 do begin CpuName[X] := '-'; Inc(X); end; WriteLn('--', CpuName); end; Results[0] := Benchmark(' Pascal control case', @Checksum_PAS, 500, 1000); Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 1000); Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 1000); FailureCode := 0; if (Results[0] <> Results[1]) then begin
Re: [fpc-devel] LEA instruction speed
On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote: GetLogicalProcessorInformation returns a Boolean - if false, an error occurred, and is handled as follows: DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation: ' + GetLastError.ToString); GetLastError = 8 indicates "out of memory", which I will say is odd. Nevertheless, because of such teething problems with the framework, I'm going to remove it from "blea" for now. As it stands, please find attached the test that appears in the merge request: https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502 The attached version still contained reference to the framework. The problem with 32-bit compilation of the framework was due to a missing stdcall calling convention in the GetLogicalProcessorInformation declaration. Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
GetLogicalProcessorInformation returns a Boolean - if false, an error occurred, and is handled as follows: DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation: ' + GetLastError.ToString); GetLastError = 8 indicates "out of memory", which I will say is odd. Nevertheless, because of such teething problems with the framework, I'm going to remove it from "blea" for now. As it stands, please find attached the test that appears in the merge request: https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502 Kit On 13/10/2023 08:34, Tomas Hajny via fpc-devel wrote: On 2023-10-13 09:26, Tomas Hajny wrote: On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote: So an update. . . The latest version of blea.pp doesn't compile with a 32-bit compiler - line 76 contains an unconditional reference to R8 register, which obviously doesn't for the 32-bit mode. BTW, the line shouldn't be necessary at all, because global variables should be initialized to 0 on program start anyway as far as I know. When fixing the problem above, compiling to 32-bit mode and running it, the test fails with an error in GetLogicalProcessorInformation (it states "8" in place of the error information; I wonder if it isn't misinterpreted, because 8 is number of logical CPUs on the machine used for running the test). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel { %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} {$DEFINE DETECTCPU} uses SysUtils, Spring.Benchmark in 'spring/Spring.Benchmark.pp'; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; var CPUName: array[0..48] of Char; {$ifdef DETECTCPU} function FillBrandName: Boolean; assembler; nostackframe; asm {$ifdef CPUX86_64} PUSH RBX {$else CPUX86_64} PUSH EBX {$endif CPUX86_64} MOV EAX, $8000 CPUID CMP EAX, $8004 JB @Unavailable {$ifdef CPUX86_64} LEA R8, [RIP + CPUName] {$endif CPUX86_64} MOV EAX, $8002 CPUID {$ifdef CPUX86_64} MOV [R8], EAX MOV [R8 + 4], EBX MOV [R8 + 8], ECX MOV [R8 + 12], EDX {$else CPUX86_64} MOV [CPUName], EAX MOV [CPUName + 4], EBX MOV [CPUName + 8], ECX MOV [CPUName + 12], EDX {$endif CPUX86_64} MOV EAX, $8003 CPUID {$ifdef CPUX86_64} MOV [R8 + 16], EAX MOV [R8 + 20], EBX MOV [R8 + 24], ECX MOV [R8 + 28], EDX {$else CPUX86_64} MOV [CPUName + 16], EAX MOV [CPUName + 20], EBX MOV [CPUName + 24], ECX MOV [CPUName + 28], EDX {$endif CPUX86_64} MOV EAX, $8004 CPUID {$ifdef CPUX86_64} MOV [R8 + 32], EAX MOV [R8 + 36], EBX MOV [R8 + 40], ECX MOV [R8 + 44], EDX MOV BYTE PTR [R8 + 48], 0 {$else CPUX86_64} MOV [CPUName + 32], EAX MOV [CPUName + 36], EBX MOV [CPUName + 40], ECX MOV [CPUName + 44], EDX MOV BYTE PTR [CPUName + 48], 0 {$endif CPUX86_64} MOV AL, 1 JMP @ExitBrand @Unavailable: XOR AL, AL @ExitBrand: {$ifdef CPUX86_64} POP RBX {$else CPUX86_64} POP EBX {$endif CPUX86_64} end; {$else DETECTCPU} function FillBrandName: Boolean; inline; begin Result := False; end; {$endif DETECTPU} function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X - 2023406815] {+$87654321 in decimal} XOR Input, Y DEC Y JNZ @Loop2 MOV Result, Input end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime; time: double; reps: cardinal; begin Result := Z; reps := 0; Write(name, ': '); start := Now; repeat inc(reps); Result := proc(Result, X, internal_reps); until (reps >= 10); time := Now - start) * SecsPerDay) * 1e9) / internal_reps) / reps; WriteLn(time:0:(2 * ord(time < 10)), ' ns/call'); end; var Results: array[0..2] of LongWord; FailureCode, X: Integer; begin if FillBrandName then begin WriteLn('CPU = ', CpuName); X := 0; while CpuName[X] <> #0 do begin CpuName[X] := '-'; Inc(X); end; WriteLn('--', CpuName); end; Results[0] := Benchmark(' Pascal control case', @Checksum_PAS, 500, 1000); Results[1] := Benchmark(' Using LEA
Re: [fpc-devel] LEA instruction speed
Oops - that was a silly mistake of mine with R8. As for the other error, that sounds like it's in the third party benchmark suite. I'll do some investigating on my virtual machine. In the meantime, here's the fixed test with the stray R8 call properly filtered out on i386 (it's replaced with "CPUName" on 32-bit). I wasn't sure if global variables were initialised or not, hence me playing safe. Kit On 13/10/2023 08:34, Tomas Hajny via fpc-devel wrote: On 2023-10-13 09:26, Tomas Hajny wrote: On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote: So an update. . . The latest version of blea.pp doesn't compile with a 32-bit compiler - line 76 contains an unconditional reference to R8 register, which obviously doesn't for the 32-bit mode. BTW, the line shouldn't be necessary at all, because global variables should be initialized to 0 on program start anyway as far as I know. When fixing the problem above, compiling to 32-bit mode and running it, the test fails with an error in GetLogicalProcessorInformation (it states "8" in place of the error information; I wonder if it isn't misinterpreted, because 8 is number of logical CPUs on the machine used for running the test). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel { %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} {$DEFINE DETECTCPU} uses SysUtils, Spring.Benchmark in 'spring/Spring.Benchmark.pp'; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; var CPUName: array[0..48] of Char; {$ifdef DETECTCPU} function FillBrandName: Boolean; assembler; nostackframe; asm {$ifdef CPUX86_64} PUSH RBX {$else CPUX86_64} PUSH EBX {$endif CPUX86_64} MOV EAX, $8000 CPUID CMP EAX, $8004 JB @Unavailable {$ifdef CPUX86_64} LEA R8, [RIP + CPUName] {$endif CPUX86_64} MOV EAX, $8002 CPUID {$ifdef CPUX86_64} MOV [R8], EAX MOV [R8 + 4], EBX MOV [R8 + 8], ECX MOV [R8 + 12], EDX {$else CPUX86_64} MOV [CPUName], EAX MOV [CPUName + 4], EBX MOV [CPUName + 8], ECX MOV [CPUName + 12], EDX {$endif CPUX86_64} MOV EAX, $8003 CPUID {$ifdef CPUX86_64} MOV [R8 + 16], EAX MOV [R8 + 20], EBX MOV [R8 + 24], ECX MOV [R8 + 28], EDX {$else CPUX86_64} MOV [CPUName + 16], EAX MOV [CPUName + 20], EBX MOV [CPUName + 24], ECX MOV [CPUName + 28], EDX {$endif CPUX86_64} MOV EAX, $8004 CPUID {$ifdef CPUX86_64} MOV [R8 + 32], EAX MOV [R8 + 36], EBX MOV [R8 + 40], ECX MOV [R8 + 44], EDX MOV BYTE PTR [R8 + 48], 0 {$else CPUX86_64} MOV [CPUName + 32], EAX MOV [CPUName + 36], EBX MOV [CPUName + 40], ECX MOV [CPUName + 44], EDX MOV BYTE PTR [CPUName + 48], 0 {$endif CPUX86_64} MOV AL, 1 JMP @ExitBrand @Unavailable: XOR AL, AL @ExitBrand: {$ifdef CPUX86_64} POP RBX {$else CPUX86_64} POP EBX {$endif CPUX86_64} end; {$else DETECTCPU} function FillBrandName: Boolean; inline; begin Result := False; end; {$endif DETECTPU} function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X -2023406815] { -2023406815 = $87654321 } XOR Input, Y DEC Y JNZ @Loop2 MOV Result, Input end; const internal_reps = 1000; procedure BM_Checksum_PAS(const State: TState); var S: TState.TValue; Z, X: LongWord; begin Z := 500; X := 1000; for S in State do begin Checksum_PAS(Z, X, internal_reps); end; end; procedure BM_Checksum_LEA(const State: TState); var S: TState.TValue; Z, X: LongWord; begin Z := 500; X := 1000; for S in State do begin Checksum_LEA(Z, X, internal_reps); end; end; procedure BM_Checksum_ADD(const State: TState); var S: TState.TValue; Z, X: LongWord; begin Z := 500; X := 1000; for S in State do begin Checksum_ADD(Z, X, internal_reps); end; end; var Results: array[0..2] of LongWord; FailureCode, X: Integer; begin {$IFDEF CPUX86} WriteLn ('32 bits:'); {$ENDIF CPUX86} {$IFDEF CPUX86_64} WriteLn ('64 bits:'); {$ENDIF CPUX86_64} if FillBrandName then begin WriteLn('CPU = ', CpuName); X := 0; while CpuName[X] <> #0 do begin CpuName[X] := '-'; Inc(X); end;
Re: [fpc-devel] LEA instruction speed
On 2023-10-13 09:26, Tomas Hajny wrote: On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote: So an update. . . The latest version of blea.pp doesn't compile with a 32-bit compiler - line 76 contains an unconditional reference to R8 register, which obviously doesn't for the 32-bit mode. BTW, the line shouldn't be necessary at all, because global variables should be initialized to 0 on program start anyway as far as I know. When fixing the problem above, compiling to 32-bit mode and running it, the test fails with an error in GetLogicalProcessorInformation (it states "8" in place of the error information; I wonder if it isn't misinterpreted, because 8 is number of logical CPUs on the machine used for running the test). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote: So an update. . . The latest version of blea.pp doesn't compile with a 32-bit compiler - line 76 contains an unconditional reference to R8 register, which obviously doesn't for the 32-bit mode. Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
So an update. I've added Spring.Benchmark to "tests/bench/spring" on my local branch, along with its readme and licence file. It seems to work quite well even if it feels a bit like overkill for this small a benchmark. Still, I've attached the version with Stefan's translated Google Benchmark unit to see what people think. A couple of things to note: * Time metrics are now in thousands of nanoseconds because the 1,000 repetitions of the internal loop (used to drown out the overhead of the function call) are no longer divided out. * Requires the fcl-base, rtl-objpas and regexpr packages. I also made a mistake with the compiler flags. I had added CPUX86_HINT_FAST_3COMP_ADDR_16 to indicate that a LEA instruction with 16-bit operands is fast, since the timing is often different to the 32/64-bit versions. However, under i386 and x86_64, the assembler doesn't accept 16-bit operands! I have therefore removed it for i386 and x86_64, although I left it in for i8086 (even though it probably won't be used) because the Pentium 4 has a slow 16-bit LEA instruction. However, the proposed COREX CPU option now has the exact same flags as ZEN3. Should it be removed, or kept for clarity and future expansion? Kit P.S. Sorry for the size of the ZIP. <> ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote: Sweet, thank you. Would you be willing to share your modified test's source? I was worried that if CPUID wasn't present it would cause a SIGILL. Sure, attached, but I didn't do anything special - I modified it in a way allowing easy disabling of this detection for x86 by disabling definition of a conditional symbol added to the source and I was prepared to recompile with the functionality disabled on the old AMD DX4 if needed. However, I didn't need to do so - the AMD DX4 machine simply ignored it and chose the branch used in case of missing support for the particular CPUID function. I have no idea if this might be due to some protection in OS/2 Warp 4 (used for compiling and running the test on that machine) potentially masking that exception, or what was the reason. Apparently, it should be possible to detect CPUID availability (albeit not 100% reliably), see https://wiki.osdev.org/CPUID, but I didn't use that. Tomas On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote: On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote: I'm all for receiving results for all kinds of processor, as it helps me to make more informed choices on flags as well as confirming that Agner Fog''s instruction tables are correct. Also, results for older processors can be hard to come by sometimes. Currently, most architectures have a fast LEA, and the default "Athlon" option lines up with this. Of the Intel architectures, the speed slows down on COREAVX onwards (COREI is fine), so I added a new COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark the point where LEA is fast again (its 16-bit version is also fast, unlike Zen 3). In the meantime I'll be looking at the benchmarking code that Stefan provided to see if it can and should be integrated. Thanks again everyone for the results you're giving. Alright, fine (I modified your test to include the CPU name as well if possible and added an IFDEFed distinction of 32-bits versus 64-bits): 32-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.84 ns/call 64-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.85 ns/call 32-bits: CPU = AMD Athlon(tm) Processor -- Pascal control case: 6.10 ns/call Using LEA instruction: 3.40 ns/call Using ADD instructions: 3.40 ns/call 32-bits: (AMD DX4 100 MHz - no CPUID name) Pascal control case: 123 ns/call Using LEA instruction: 72 ns/call Using ADD instructions: 73 ns/call Tomas { %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} {$DEFINE DETECTCPU} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; var CPUName: array[0..48] of Char; {$ifdef CPUX86_64} function FillBrandName: Boolean; assembler; nostackframe; asm PUSH RBX MOV EAX, $8000 CPUID CMP EAX, $8004 JB @Unavailable LEA R8, [RIP + CPUName] MOV EAX, $8002 CPUID MOV [R8], EAX MOV [R8 + 4], EBX MOV [R8 + 8], ECX MOV [R8 + 12], EDX MOV EAX, $8003 CPUID MOV [R8 + 16], EAX MOV [R8 + 20], EBX MOV [R8 + 24], ECX MOV [R8 + 28], EDX MOV EAX, $8004 CPUID MOV [R8 + 32], EAX MOV [R8 + 36], EBX MOV [R8 + 40], ECX MOV [R8 + 44], EDX MOV BYTE PTR [R8 + 48], 0 MOV AL, 1 JMP @ExitBrand @Unavailable: XOR AL, AL @ExitBrand: POP RBX end; {$else CPUX86_64} function FillBrandName: Boolean; assembler; nostackframe; asm {$IFDEF DETECTCPU} push ebx mov eax, $8000 cpuid cmp eax, $8004 jb @not_supported lea esi, CPUName mov eax, 8002h cpuid mov [esi], eax mov [esi+4], ebx mov [esi+8], ecx mov [esi+12], edx mov eax, 8003h cpuid mov [esi+16], eax mov [esi+20], ebx mov [esi+24], ecx mov [esi+28], edx mov eax, 8004h cpuid mov [esi+32], eax mov [esi+36], ebx mov [esi+40], ecx mov [esi+44], edx mov eax, 1 jmp @exit @not_supported: xor eax, eax @exit: pop ebx {$ELSE DETECTCPU} xor eax, eax {$ENDIF DETECTPU} end; {$endif CPUX86_64} function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV
Re: [fpc-devel] LEA instruction speed
The LEA and ADD times are close enough that I can consider them identical. And Braswell (the architecture behind that brand of Celeron) doesn't support AVX, I don't think, so that lines up with COREI having a fast LEA instruction but not COREAVX. Given the many different x86-compatible CPUs, I wonder if we need to document the best compiler parameters for end users in some way (e.g. so it can be coded in a device driver installer so the most optimised binary can be installed for a given CPU architecture). Kit On 11/10/2023 05:56, Christo Crause wrote: On Tue, Oct 10, 2023 at 11:13 AM J. Gareth Moreton via fpc-devel wrote: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. Kit Results on a modest CPU: CPU = Intel(R) Celeron(R) CPU N3050 @ 1.60GHz - Pascal control case: 6.71 ns/call Using LEA instruction: 2.09 ns/call Using ADD instructions: 2.05 ns/call 32 bits: Pascal control case: 6.78 ns/call Using LEA instruction: 2.16 ns/call Using ADD instructions: 2.09 ns/call Results show a bit of variance, above numbers are more or less typical. Christo ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On Tue, Oct 10, 2023 at 11:13 AM J. Gareth Moreton via fpc-devel wrote: > > Thanks Tomas, > > Nothing is broken, but the timing measurement isn't precise enough. > > Normally I have a much higher iteration count (e.g. 1,000,000), but I > had reduced it to 10,000 because, coupled with the 1,000 iterations in > the subroutines themselves, would have led to 1,000,000,000 passes and > hence would take in the region of five to ten minutes to complete for a > 16 MHz 386, for example. Rika's suggestion of running as many > iterations as needed until, say, 5 seconds elapses, would help but the > timing measurements would cause a lot of latency and will be imprecise > on very slow routines. Still, let's see if 100,000 gives better results > for you. > > Kit Results on a modest CPU: CPU = Intel(R) Celeron(R) CPU N3050 @ 1.60GHz - Pascal control case: 6.71 ns/call Using LEA instruction: 2.09 ns/call Using ADD instructions: 2.05 ns/call 32 bits: Pascal control case: 6.78 ns/call Using LEA instruction: 2.16 ns/call Using ADD instructions: 2.09 ns/call Results show a bit of variance, above numbers are more or less typical. Christo ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Sweet, thank you. Would you be willing to share your modified test's source? I was worried that if CPUID wasn't present it would cause a SIGILL. Kit On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote: On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote: I'm all for receiving results for all kinds of processor, as it helps me to make more informed choices on flags as well as confirming that Agner Fog''s instruction tables are correct. Also, results for older processors can be hard to come by sometimes. Currently, most architectures have a fast LEA, and the default "Athlon" option lines up with this. Of the Intel architectures, the speed slows down on COREAVX onwards (COREI is fine), so I added a new COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark the point where LEA is fast again (its 16-bit version is also fast, unlike Zen 3). In the meantime I'll be looking at the benchmarking code that Stefan provided to see if it can and should be integrated. Thanks again everyone for the results you're giving. Alright, fine (I modified your test to include the CPU name as well if possible and added an IFDEFed distinction of 32-bits versus 64-bits): 32-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.84 ns/call 64-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.85 ns/call 32-bits: CPU = AMD Athlon(tm) Processor -- Pascal control case: 6.10 ns/call Using LEA instruction: 3.40 ns/call Using ADD instructions: 3.40 ns/call 32-bits: (AMD DX4 100 MHz - no CPUID name) Pascal control case: 123 ns/call Using LEA instruction: 72 ns/call Using ADD instructions: 73 ns/call Tomas On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote: On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote: Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. I had the same problem, and now it is stable Ryzen 5700X (ZEN3) Pascal control case: 0.7 ns/call Using LEA instruction: 0.4 ns/call Using ADD instructions: 0.7 ns/call Indeed, it's much more consistent now, attached a new log for both 32-bit and 64-bit versions from the Intel machine with Windows. Apparently, ADD is still somewhat faster on such "newer" Intel machines (at least if not considering the potential parallelism of LEA discussed previously). I can try this version on my AMD machines later tonight if considered useful - please, let me know which results would be relevant for you in that case (out of the ancient AMD DX4, only slightly less ancient AMD Athlon 1 GHz and the still rather reasonable AMD A9). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote: I'm all for receiving results for all kinds of processor, as it helps me to make more informed choices on flags as well as confirming that Agner Fog''s instruction tables are correct. Also, results for older processors can be hard to come by sometimes. Currently, most architectures have a fast LEA, and the default "Athlon" option lines up with this. Of the Intel architectures, the speed slows down on COREAVX onwards (COREI is fine), so I added a new COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark the point where LEA is fast again (its 16-bit version is also fast, unlike Zen 3). In the meantime I'll be looking at the benchmarking code that Stefan provided to see if it can and should be integrated. Thanks again everyone for the results you're giving. Alright, fine (I modified your test to include the CPU name as well if possible and added an IFDEFed distinction of 32-bits versus 64-bits): 32-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.84 ns/call 64-bits: CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G - Pascal control case: 0.85 ns/call Using LEA instruction: 0.56 ns/call Using ADD instructions: 0.85 ns/call 32-bits: CPU = AMD Athlon(tm) Processor -- Pascal control case: 6.10 ns/call Using LEA instruction: 3.40 ns/call Using ADD instructions: 3.40 ns/call 32-bits: (AMD DX4 100 MHz - no CPUID name) Pascal control case: 123 ns/call Using LEA instruction: 72 ns/call Using ADD instructions: 73 ns/call Tomas On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote: On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote: Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. I had the same problem, and now it is stable Ryzen 5700X (ZEN3) Pascal control case: 0.7 ns/call Using LEA instruction: 0.4 ns/call Using ADD instructions: 0.7 ns/call Indeed, it's much more consistent now, attached a new log for both 32-bit and 64-bit versions from the Intel machine with Windows. Apparently, ADD is still somewhat faster on such "newer" Intel machines (at least if not considering the potential parallelism of LEA discussed previously). I can try this version on my AMD machines later tonight if considered useful - please, let me know which results would be relevant for you in that case (out of the ancient AMD DX4, only slightly less ancient AMD Athlon 1 GHz and the still rather reasonable AMD A9). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
I'm all for receiving results for all kinds of processor, as it helps me to make more informed choices on flags as well as confirming that Agner Fog''s instruction tables are correct. Also, results for older processors can be hard to come by sometimes. Currently, most architectures have a fast LEA, and the default "Athlon" option lines up with this. Of the Intel architectures, the speed slows down on COREAVX onwards (COREI is fine), so I added a new COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark the point where LEA is fast again (its 16-bit version is also fast, unlike Zen 3). In the meantime I'll be looking at the benchmarking code that Stefan provided to see if it can and should be integrated. Thanks again everyone for the results you're giving. Kit P.S. In regards to parallelisation in having LEA instructions running alongside other arithmetic/logical operations, that will be an interesting field of research. At the very least, the post-peephole stage can change ADD or SUB into a LEA if using an AGU over an ALU appears to give a micro-optimisation. It also benefits hyperthreading, as the ALUs tend to be very heavily used, while AGUs tend to be used one at a time. On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote: On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote: Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. I had the same problem, and now it is stable Ryzen 5700X (ZEN3) Pascal control case: 0.7 ns/call Using LEA instruction: 0.4 ns/call Using ADD instructions: 0.7 ns/call Indeed, it's much more consistent now, attached a new log for both 32-bit and 64-bit versions from the Intel machine with Windows. Apparently, ADD is still somewhat faster on such "newer" Intel machines (at least if not considering the potential parallelism of LEA discussed previously). I can try this version on my AMD machines later tonight if considered useful - please, let me know which results would be relevant for you in that case (out of the ancient AMD DX4, only slightly less ancient AMD Athlon 1 GHz and the still rather reasonable AMD A9). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote: Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. I had the same problem, and now it is stable Ryzen 5700X (ZEN3) Pascal control case: 0.7 ns/call Using LEA instruction: 0.4 ns/call Using ADD instructions: 0.7 ns/call Indeed, it's much more consistent now, attached a new log for both 32-bit and 64-bit versions from the Intel machine with Windows. Apparently, ADD is still somewhat faster on such "newer" Intel machines (at least if not considering the potential parallelism of LEA discussed previously). I can try this version on my AMD machines later tonight if considered useful - please, let me know which results would be relevant for you in that case (out of the ancient AMD DX4, only slightly less ancient AMD Athlon 1 GHz and the still rather reasonable AMD A9). Tomas 32-bit version, 10 runs in a row using a command shell for cycle: Pascal control case: 0.85 ns/call Using LEA instruction: 1.11 ns/call Using ADD instructions: 0.74 ns/call Pascal control case: 0.95 ns/call Using LEA instruction: 0.95 ns/call Using ADD instructions: 0.81 ns/call Pascal control case: 0.91 ns/call Using LEA instruction: 0.98 ns/call Using ADD instructions: 0.83 ns/call Pascal control case: 0.90 ns/call Using LEA instruction: 1.12 ns/call Using ADD instructions: 0.78 ns/call Pascal control case: 0.87 ns/call Using LEA instruction: 1.03 ns/call Using ADD instructions: 0.71 ns/call Pascal control case: 0.87 ns/call Using LEA instruction: 1.03 ns/call Using ADD instructions: 0.79 ns/call Pascal control case: 0.81 ns/call Using LEA instruction: 1.20 ns/call Using ADD instructions: 0.92 ns/call Pascal control case: 0.97 ns/call Using LEA instruction: 1.01 ns/call Using ADD instructions: 0.74 ns/call Pascal control case: 0.92 ns/call Using LEA instruction: 0.99 ns/call Using ADD instructions: 0.81 ns/call Pascal control case: 0.90 ns/call Using LEA instruction: 1.00 ns/call Using ADD instructions: 0.77 ns/call 64-bit version, 10 runs in a row using a command shell for cycle: CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 1.04 ns/call Using LEA instruction: 1.09 ns/call Using ADD instructions: 0.82 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 1.07 ns/call Using LEA instruction: 1.07 ns/call Using ADD instructions: 0.71 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 0.98 ns/call Using LEA instruction: 1.07 ns/call Using ADD instructions: 0.80 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 1.11 ns/call Using LEA instruction: 1.09 ns/call Using ADD instructions: 0.75 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 0.98 ns/call Using LEA instruction: 1.02 ns/call Using ADD instructions: 0.78 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 1.09 ns/call Using LEA instruction: 1.13 ns/call Using ADD instructions: 0.69 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 0.98 ns/call Using LEA instruction: 1.11 ns/call Using ADD instructions: 0.81 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 0.95 ns/call Using LEA instruction: 1.07 ns/call Using ADD instructions: 0.71 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 1.04 ns/call Using LEA instruction: 1.01 ns/call Using ADD instructions: 0.70 ns/call CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 1.05 ns/call Using LEA instruction: 0.99 ns/call Using ADD instructions: 0.71 ns/call ___ fpc-devel maillist - fpc-devel@lists.freepascal.org
Re: [fpc-devel] LEA instruction speed
Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. I had the same problem, and now it is stable Ryzen 5700X (ZEN3) Pascal control case: 0.7 ns/call Using LEA instruction: 0.4 ns/call Using ADD instructions: 0.7 ns/call ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Ooo, that might be just what we need. Thank you Stefan. Kit On 10/10/2023 10:57, Stefan Glienke via fpc-devel wrote: Be my guest making https://github.com/spring4d/benchmark compatible for all platforms you need it for. On 10/10/2023 11:13 CEST J. Gareth Moreton via fpc-devel wrote: Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. Kit On 10/10/2023 09:57, Tomas Hajny wrote: On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote: Hi Kit, I updated the "blea" test in the merge request so it now displays the processor brand name on x86_64; however, it is not fetched under i386 because CPUID was not introduced until later 486 processors. I've attached it to this e-mail if anyone wants to take a look to ensure I haven't broken something. I don't know what's broken, but the results vary so much on a fast machine that they are unusable for any measurement from my point of view (standard 3.2.2 compiler, compiled with -O4 and running under MS Windows this time). Sometimes the ADD version shows 0.0 ns/call, sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call (64-bits). See the attached results (the CPU is only displayed for the 64-bit compilation, but it's obviously the same CPU). Tomas On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote: Thank you very much! That processor is built on the Excavator architecture and lines up with the flag I put in the merge request (i.e. it has the "fast LEA" hint). I honestly didn't expect this much testing feedback, so thank you all! Gareth aka. Kit P.S. I'm tempted to extend the test slightly to actually name the CPU automatically. On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Be my guest making https://github.com/spring4d/benchmark compatible for all platforms you need it for. > On 10/10/2023 11:13 CEST J. Gareth Moreton via fpc-devel > wrote: > > > Thanks Tomas, > > Nothing is broken, but the timing measurement isn't precise enough. > > Normally I have a much higher iteration count (e.g. 1,000,000), but I > had reduced it to 10,000 because, coupled with the 1,000 iterations in > the subroutines themselves, would have led to 1,000,000,000 passes and > hence would take in the region of five to ten minutes to complete for a > 16 MHz 386, for example. Rika's suggestion of running as many > iterations as needed until, say, 5 seconds elapses, would help but the > timing measurements would cause a lot of latency and will be imprecise > on very slow routines. Still, let's see if 100,000 gives better results > for you. > > Kit > > On 10/10/2023 09:57, Tomas Hajny wrote: > > On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote: > > > > > > Hi Kit, > > > >> I updated the "blea" test in the merge request so it now displays the > >> processor brand name on x86_64; however, it is not fetched under i386 > >> because CPUID was not introduced until later 486 processors. I've > >> attached it to this e-mail if anyone wants to take a look to ensure I > >> haven't broken something. > > > > I don't know what's broken, but the results vary so much on a fast > > machine that they are unusable for any measurement from my point of > > view (standard 3.2.2 compiler, compiled with -O4 and running under MS > > Windows this time). Sometimes the ADD version shows 0.0 ns/call, > > sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call > > (64-bits). See the attached results (the CPU is only displayed for the > > 64-bit compilation, but it's obviously the same CPU). > > > > Tomas > > > > > >> > >> On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote: > >>> Thank you very much! That processor is built on the Excavator > >>> architecture and lines up with the flag I put in the merge request > >>> (i.e. it has the "fast LEA" hint). > >>> > >>> I honestly didn't expect this much testing feedback, so thank you all! > >>> > >>> Gareth aka. Kit > >>> > >>> P.S. I'm tempted to extend the test slightly to actually name the > >>> CPU automatically. > >>> > >>> On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: > My results: > jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" > model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G > jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp > Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 > Copyright (c) 1993-2021 by Florian Klaempfl and others > Target OS: Linux for x86-64 > Compiling blea.pp > Linking blea > 95 lines compiled, 0.2 sec > jean@First-Boss:~/temp$ ./blea > Pascal control case: 5.1 ns/call > Using LEA instruction: 0.5 ns/call > Using ADD instructions: 0.8 ns/call > jean@First-Boss:~/temp$ > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > > >>> ___ > >>> fpc-devel maillist - fpc-devel@lists.freepascal.org > >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > >>> > >> ___ > >> fpc-devel maillist - fpc-devel@lists.freepascal.org > >> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Looking at the text log, the results are a bit strange and I can't easily explain it. Normally a system interrupt would increase the time taken. Let me know if increasing the iteration count fixes it or not. Kit On 10/10/2023 09:57, Tomas Hajny wrote: On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote: Hi Kit, I updated the "blea" test in the merge request so it now displays the processor brand name on x86_64; however, it is not fetched under i386 because CPUID was not introduced until later 486 processors. I've attached it to this e-mail if anyone wants to take a look to ensure I haven't broken something. I don't know what's broken, but the results vary so much on a fast machine that they are unusable for any measurement from my point of view (standard 3.2.2 compiler, compiled with -O4 and running under MS Windows this time). Sometimes the ADD version shows 0.0 ns/call, sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call (64-bits). See the attached results (the CPU is only displayed for the 64-bit compilation, but it's obviously the same CPU). Tomas On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote: Thank you very much! That processor is built on the Excavator architecture and lines up with the flag I put in the merge request (i.e. it has the "fast LEA" hint). I honestly didn't expect this much testing feedback, so thank you all! Gareth aka. Kit P.S. I'm tempted to extend the test slightly to actually name the CPU automatically. On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Thanks Tomas, Nothing is broken, but the timing measurement isn't precise enough. Normally I have a much higher iteration count (e.g. 1,000,000), but I had reduced it to 10,000 because, coupled with the 1,000 iterations in the subroutines themselves, would have led to 1,000,000,000 passes and hence would take in the region of five to ten minutes to complete for a 16 MHz 386, for example. Rika's suggestion of running as many iterations as needed until, say, 5 seconds elapses, would help but the timing measurements would cause a lot of latency and will be imprecise on very slow routines. Still, let's see if 100,000 gives better results for you. Kit On 10/10/2023 09:57, Tomas Hajny wrote: On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote: Hi Kit, I updated the "blea" test in the merge request so it now displays the processor brand name on x86_64; however, it is not fetched under i386 because CPUID was not introduced until later 486 processors. I've attached it to this e-mail if anyone wants to take a look to ensure I haven't broken something. I don't know what's broken, but the results vary so much on a fast machine that they are unusable for any measurement from my point of view (standard 3.2.2 compiler, compiled with -O4 and running under MS Windows this time). Sometimes the ADD version shows 0.0 ns/call, sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call (64-bits). See the attached results (the CPU is only displayed for the 64-bit compilation, but it's obviously the same CPU). Tomas On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote: Thank you very much! That processor is built on the Excavator architecture and lines up with the flag I put in the merge request (i.e. it has the "fast LEA" hint). I honestly didn't expect this much testing feedback, so thank you all! Gareth aka. Kit P.S. I'm tempted to extend the test slightly to actually name the CPU automatically. On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; var CPUName: array[0..48] of Char; {$ifdef CPUX86_64} function FillBrandName: Boolean; assembler; nostackframe; asm PUSH RBX MOV EAX, $8000 CPUID CMP EAX, $8004 JB @Unavailable LEA R8, [RIP + CPUName] MOV EAX, $8002 CPUID MOV [R8], EAX MOV [R8 + 4], EBX MOV [R8 + 8], ECX MOV [R8 + 12], EDX MOV EAX, $8003 CPUID MOV [R8 + 16], EAX MOV [R8 + 20], EBX MOV [R8 + 24], ECX MOV [R8 + 28], EDX MOV EAX, $8004 CPUID MOV [R8 + 32], EAX MOV [R8 + 36], EBX MOV [R8 + 40], ECX MOV [R8 + 44], EDX MOV BYTE PTR [R8 + 48], 0 MOV AL, 1 JMP @ExitBrand @Unavailable: XOR AL, AL @ExitBrand: POP RBX end; {$else CPUX86_64} function FillBrandName: Boolean; inline; begin Result := False; end; {$endif CPUX86_64} function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, Y DEC Y JNZ @Loop2 MOV Result, Input end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime;
Re: [fpc-devel] LEA instruction speed
On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote: Hi Kit, I updated the "blea" test in the merge request so it now displays the processor brand name on x86_64; however, it is not fetched under i386 because CPUID was not introduced until later 486 processors. I've attached it to this e-mail if anyone wants to take a look to ensure I haven't broken something. I don't know what's broken, but the results vary so much on a fast machine that they are unusable for any measurement from my point of view (standard 3.2.2 compiler, compiled with -O4 and running under MS Windows this time). Sometimes the ADD version shows 0.0 ns/call, sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call (64-bits). See the attached results (the CPU is only displayed for the 64-bit compilation, but it's obviously the same CPU). Tomas On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote: Thank you very much! That processor is built on the Excavator architecture and lines up with the flag I put in the merge request (i.e. it has the "fast LEA" hint). I honestly didn't expect this much testing feedback, so thank you all! Gareth aka. Kit P.S. I'm tempted to extend the test slightly to actually name the CPU automatically. On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel 32-bit version, 10 runs in a row using a command shell for cycle, LEA before ADD (original version): Pascal control case: 0.9 ns/call Using LEA instruction: 0.0 ns/call Using ADD instructions: 1.6 ns/call Pascal control case: 0.4 ns/call Using LEA instruction: 1.5 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.1 ns/call Using LEA instruction: 1.6 ns/call Using ADD instructions: 1.2 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.8 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.0 ns/call Using ADD instructions: 1.6 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.6 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.8 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.1 ns/call Using LEA instruction: 1.6 ns/call Using ADD instructions: 0.8 ns/call Pascal control case: 1.1 ns/call Using LEA instruction: 0.1 ns/call Using ADD instructions: 1.6 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.5 ns/call Using ADD instructions: 0.0 ns/call 32-bit version, 10 runs in a row using a command shell for cycle, LEA before ADD (original version): Pascal control case: 0.9 ns/call Using LEA instruction: 0.0 ns/call Using ADD instructions: 1.6 ns/call Pascal control case: 0.4 ns/call Using LEA instruction: 1.5 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.1 ns/call Using LEA instruction: 1.6 ns/call Using ADD instructions: 1.2 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.8 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.0 ns/call Using ADD instructions: 1.6 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.6 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.8 ns/call Using ADD instructions: 0.0 ns/call Pascal control case: 0.1 ns/call Using LEA instruction: 1.6 ns/call Using ADD instructions: 0.8 ns/call Pascal control case: 1.1 ns/call Using LEA instruction: 0.1 ns/call Using ADD instructions: 1.6 ns/call Pascal control case: 0.2 ns/call Using LEA instruction: 1.5 ns/call Using ADD instructions: 0.0 ns/call 64-bit version, 10 runs in a row using a command shell for cycle, LEA before ADD (original version): CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz --- Pascal control case: 0.6 ns/call Using LEA instruction:
Re: [fpc-devel] LEA instruction speed
My results on Windows : E:\temp>C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.exe -MObjFPC -Scghi -O1 -g -gl -l -vewnhibq -Fu. -FUlib\x86_64-win64 -FE. -oblea.exe blea.pp Hint: (11030) Start of reading config file C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.cfg Hint: (11031) End of reading config file C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.cfg Free Pascal Compiler version 3.2.2 [2022/09/24] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others (1002) Target OS: Win64 for x64 (3104) Compiling blea.pp (9015) Linking .\blea.exe (1008) 150 lines compiled, 0.2 sec, 87840 bytes code, 5364 bytes data (1022) 2 hint(s) issued E:\temp>blea CPU = Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz - Pascal control case: 1.9 ns/call Using LEA instruction: 1.2 ns/call Using ADD instructions: 0.9 ns/call E:\temp> ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
I updated the "blea" test in the merge request so it now displays the processor brand name on x86_64; however, it is not fetched under i386 because CPUID was not introduced until later 486 processors. I've attached it to this e-mail if anyone wants to take a look to ensure I haven't broken something. Kit On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote: Thank you very much! That processor is built on the Excavator architecture and lines up with the flag I put in the merge request (i.e. it has the "fast LEA" hint). I honestly didn't expect this much testing feedback, so thank you all! Gareth aka. Kit P.S. I'm tempted to extend the test slightly to actually name the CPU automatically. On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel { %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; var CPUName: array[0..48] of Char; {$ifdef CPUX86_64} function FillBrandName: Boolean; assembler; nostackframe; asm PUSH RBX MOV EAX, $8000 CPUID CMP EAX, $8004 JB @Unavailable LEA R8, [RIP + CPUName] MOV EAX, $8002 CPUID MOV [R8], EAX MOV [R8 + 4], EBX MOV [R8 + 8], ECX MOV [R8 + 12], EDX MOV EAX, $8003 CPUID MOV [R8 + 16], EAX MOV [R8 + 20], EBX MOV [R8 + 24], ECX MOV [R8 + 28], EDX MOV EAX, $8004 CPUID MOV [R8 + 32], EAX MOV [R8 + 36], EBX MOV [R8 + 40], ECX MOV [R8 + 44], EDX MOV BYTE PTR [R8 + 48], 0 MOV AL, 1 JMP @ExitBrand @Unavailable: XOR AL, AL @ExitBrand: POP RBX end; {$else CPUX86_64} function FillBrandName: Boolean; inline; begin Result := False; end; {$endif CPUX86_64} function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, Y DEC Y JNZ @Loop2 MOV Result, Input end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime; time: double; reps: cardinal; begin Result := Z; reps := 0; start := Now; repeat inc(reps); Result := proc(Result, X, internal_reps); until (reps >= 1); time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9; writeln(name, ': ', time:0:ord(time < 10), ' ns/call'); end; var Results: array[0..2] of LongWord; FailureCode, X: Integer; begin if FillBrandName then begin WriteLn('CPU = ', CpuName); X := 0; while CpuName[X] <> #0 do begin CpuName[X] := '-'; Inc(X); end; WriteLn('--', CpuName); end; Results[0] := Benchmark(' Pascal control case', @Checksum_PAS, 500, 1000); Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 1000); Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 1000); FailureCode := 0; if (Results[0] <> Results[1]) then begin WriteLn('ERROR: Checksum_LEA doesn''t match control case'); FailureCode := FailureCode or 1; end; if (Results[0] <> Results[2]) then begin WriteLn('ERROR: Checksum_ADD doesn''t match control case'); FailureCode := FailureCode or 2 end; if FailureCode <> 0 then Halt(FailureCode); end. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org
Re: [fpc-devel] LEA instruction speed
Thank you very much! That processor is built on the Excavator architecture and lines up with the flag I put in the merge request (i.e. it has the "fast LEA" hint). I honestly didn't expect this much testing feedback, so thank you all! Gareth aka. Kit P.S. I'm tempted to extend the test slightly to actually name the CPU automatically. On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote: My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
My results: jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name" model name : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling blea.pp Linking blea 95 lines compiled, 0.2 sec jean@First-Boss:~/temp$ ./blea Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.8 ns/call jean@First-Boss:~/temp$ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Thank you for the report. According to Agner Fog's table, complex LEA instructions should have a 3-cycle latency on that architecture (Haswell). Optimisations with this instruction are proving interesting because there's such a variety between processor architectures. There are some that are fine with 3 components, but slows right down if a scale factor is used. Kit On 09/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote: Hi Gareth model name : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Regards Nataraj S Narayan Synergy Info Systems Software & Technology Consultants Ettumanoor, INDIA Ph:+91 9443211326 On Sun, Oct 8, 2023 at 6:40 PM J. Gareth Moreton via fpc-devel wrote: Hi Nataraj Which processor is that run on? (although too close to call, it implies LEA has a latency of 2 in that case) Kit On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote: Hi [nataraj@dflyHP ~]$ fpc ttt.pas Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: DragonFly for x86-64 Compiling ttt.pas Linking ttt /usr/local/bin/ld.bfd: warning: /usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing .note.GNU-stack section implies executable stack /usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be removed in a future version of the linker 121 lines compiled, 14.9 sec [nataraj@dflyHP ~]$ ./ttt Pascal control case: 6.7 ns/call Using LEA instruction: 4.2 ns/call Using ADD instructions: 4.0 ns/call Nataraj S Narayan Synergy Info Systems Software & Technology Consultants Ettumanoor, INDIA Ph:+91 9443211326 On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel wrote: That's interesting; I am interested to see the assembly output for the Pascal control cases. As for the 64-bit version, that was my fault since the assembly language is for Microsoft's ABI rather than the System V ABI, so it was checking a register with an undefined value. Find attached the fixed test. Kit P.S. Results on my Intel(R) Core(TM) i7-10750H Pascal control case: 2.0 ns/call Using LEA instruction: 1.7 ns/call Using ADD instructions: 1.3 ns/call On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: > On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: > > > Hi Kit, > >> Do you think this should suffice? Originally it ran for 1,000,000 >> repetitions but I fear that will take way too long on a 486, so I >> reduced it to 10,000. > > OK, I tried it now. First of all, after turning on the old machine, I > realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad > memory. :-( I compiled and ran the test under OS/2 there (I was too > lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any > substantial difference. The ADD and LEA results were basically the > same there, both around 100 ns / call. The Pascal result was around > twice as long. Interestingly, the Pascal result for FPC 3.2.2 was > around 10% longer than the same source compiled with FPC 2.0.3 (the > assembler versions were obviously the same for both FPC versions; I > tried compiling it also with FPC 1.0.10 and the assembler versions > were more than three times slower due to missing support for the > nostackframe directive). > > I tested it under the AMD Athlon 1 GHz machine as well and again, the > results for LEA and ADD are basically equal (both 3.1 ns/call) and the > result for Pascal slightly more than twice (7.3 ns/call). However, > rather surprisingly for me, the overall test run was _much_ longer > there?! Finally, I tried compiling the test on a 64-bit machine (AMD > A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from > a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the > assembler version runs forever - well, certainly much longer than my > patience lasts. I haven't tried to analyze the reasons, but that's > what I get. > > Tomas > > > >> >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: >>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" >>> wrote: >>> >>> >>> Hii Kit, >>> This is mainly to Florian, but also to anyone else who can answer the
Re: [fpc-devel] LEA instruction speed
Hi Gareth model name : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz Regards Nataraj S Narayan Synergy Info Systems Software & Technology Consultants Ettumanoor, INDIA Ph:+91 9443211326 On Sun, Oct 8, 2023 at 6:40 PM J. Gareth Moreton via fpc-devel < fpc-devel@lists.freepascal.org> wrote: > Hi Nataraj > > Which processor is that run on? (although too close to call, it implies > LEA has a latency of 2 in that case) > > Kit > On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote: > > Hi > > [nataraj@dflyHP ~]$ fpc ttt.pas > Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64 > Copyright (c) 1993-2021 by Florian Klaempfl and others > Target OS: DragonFly for x86-64 > Compiling ttt.pas > Linking ttt > /usr/local/bin/ld.bfd: warning: > /usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing > .note.GNU-stack section implies executable stack > /usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be > removed in a future version of the linker > 121 lines compiled, 14.9 sec > [nataraj@dflyHP ~]$ ./ttt >Pascal control case: 6.7 ns/call > Using LEA instruction: 4.2 ns/call > Using ADD instructions: 4.0 ns/call > > > Nataraj S Narayan > Synergy Info Systems > Software & Technology Consultants > Ettumanoor, INDIA > Ph:+91 9443211326 > > > On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel < > fpc-devel@lists.freepascal.org> wrote: > >> That's interesting; I am interested to see the assembly output for the >> Pascal control cases. As for the 64-bit version, that was my fault >> since the assembly language is for Microsoft's ABI rather than the >> System V ABI, so it was checking a register with an undefined value. >> Find attached the fixed test. >> >> Kit >> >> P.S. Results on my Intel(R) Core(TM) i7-10750H >> >> Pascal control case: 2.0 ns/call >> Using LEA instruction: 1.7 ns/call >> Using ADD instructions: 1.3 ns/call >> >> On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: >> > On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: >> > >> > >> > Hi Kit, >> > >> >> Do you think this should suffice? Originally it ran for 1,000,000 >> >> repetitions but I fear that will take way too long on a 486, so I >> >> reduced it to 10,000. >> > >> > OK, I tried it now. First of all, after turning on the old machine, I >> > realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad >> > memory. :-( I compiled and ran the test under OS/2 there (I was too >> > lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any >> > substantial difference. The ADD and LEA results were basically the >> > same there, both around 100 ns / call. The Pascal result was around >> > twice as long. Interestingly, the Pascal result for FPC 3.2.2 was >> > around 10% longer than the same source compiled with FPC 2.0.3 (the >> > assembler versions were obviously the same for both FPC versions; I >> > tried compiling it also with FPC 1.0.10 and the assembler versions >> > were more than three times slower due to missing support for the >> > nostackframe directive). >> > >> > I tested it under the AMD Athlon 1 GHz machine as well and again, the >> > results for LEA and ADD are basically equal (both 3.1 ns/call) and the >> > result for Pascal slightly more than twice (7.3 ns/call). However, >> > rather surprisingly for me, the overall test run was _much_ longer >> > there?! Finally, I tried compiling the test on a 64-bit machine (AMD >> > A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from >> > a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the >> > assembler version runs forever - well, certainly much longer than my >> > patience lasts. I haven't tried to analyze the reasons, but that's >> > what I get. >> > >> > Tomas >> > >> > >> > >> >> >> >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: >> >>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" >> >>> wrote: >> >>> >> >>> >> >>> Hii Kit, >> >>> >> This is mainly to Florian, but also to anyone else who can answer >> the question - at which point did a complex LEA instruction (using >> all three input operands and some other specific circumstances) get >> slow? Preliminary research suggests the 486 was when it gained >> extra latency, and then Sandy Bridge when it got particularly bad. >> Icy Lake seems to be the architecture where faster LEA instructions >> are reintroduced, but I'm not sure about AMD processors. >> >>> I cannot answer your question, but if you prepare a test program, I >> >>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines >> >>> if it helps you in any way (at least I hope the 486 DX2 machine >> >>> should be still able to start ;-) ). >> >>> >> >>> Tomas >> >>> >> >>> ___ >> >>> fpc-devel maillist - fpc-devel@lists.freepascal.org >> >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel >> >>> >> >>
Re: [fpc-devel] LEA instruction speed
On 2023-10-08 13:45, J. Gareth Moreton via fpc-devel wrote: Sorry, ignore last attachment - I forgot to change a line of assembly (it was correct for x86_64-win64!!). Here is the corrected version. Alright, results for this version for AMD A9 9425 under Linux (the same trunk compiler as used yesterday): 64-bit version: Pascal control case: 5.1 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.9 ns/call 32-bit version: Pascal control case: 0.9 ns/call Using LEA instruction: 0.5 ns/call Using ADD instructions: 0.9 ns/call Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Did some checking of the test I copied the code from, and I forgot that Rika's original code only exited once a certain time period had elapsed (e.g. 0.5 seconds). I had changed it to a standard iteration count since I was concerned about fairness and accuracy, but I only changed the loop condition and nothing else. Kit On 08/10/2023 11:06, Marģers . via fpc-devel wrote: 1. why you leave "time:=..." in benchmark loop? It does add 50% of execution time per call. 2. Pascal version does not match assembler version. Had to fix it. //Result := X + Counter + $87654321; Result:=Result + X + $87654321; Result:=Result xor y; 3. Assembler functions can be unified to work under win64,win32, linux 64, linux 32 function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, y DEC y JNZ @Loop2 MOV EAX, Input end; 4. My results. Ryzen 2700x Pascal control case: 0.7 ns/call 0.0710 Using LEA instruction: 0.7 ns/call 0.0700 Using ADD instructions: 0.7 ns/call 0.0710 Even thou results are equal, i was able to add 4 independent ADD instructions around LEA while results didn't chance, but only 2 around ADD. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
In the meantime, here's the merge request for the feature based on user tests and studying of Agner Fog's instruction tables: https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502 Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Hi Nataraj Which processor is that run on? (although too close to call, it implies LEA has a latency of 2 in that case) Kit On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote: Hi [nataraj@dflyHP ~]$ fpc ttt.pas Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: DragonFly for x86-64 Compiling ttt.pas Linking ttt /usr/local/bin/ld.bfd: warning: /usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing .note.GNU-stack section implies executable stack /usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be removed in a future version of the linker 121 lines compiled, 14.9 sec [nataraj@dflyHP ~]$ ./ttt Pascal control case: 6.7 ns/call Using LEA instruction: 4.2 ns/call Using ADD instructions: 4.0 ns/call Nataraj S Narayan Synergy Info Systems Software & Technology Consultants Ettumanoor, INDIA Ph:+91 9443211326 On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel wrote: That's interesting; I am interested to see the assembly output for the Pascal control cases. As for the 64-bit version, that was my fault since the assembly language is for Microsoft's ABI rather than the System V ABI, so it was checking a register with an undefined value. Find attached the fixed test. Kit P.S. Results on my Intel(R) Core(TM) i7-10750H Pascal control case: 2.0 ns/call Using LEA instruction: 1.7 ns/call Using ADD instructions: 1.3 ns/call On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: > On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: > > > Hi Kit, > >> Do you think this should suffice? Originally it ran for 1,000,000 >> repetitions but I fear that will take way too long on a 486, so I >> reduced it to 10,000. > > OK, I tried it now. First of all, after turning on the old machine, I > realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad > memory. :-( I compiled and ran the test under OS/2 there (I was too > lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any > substantial difference. The ADD and LEA results were basically the > same there, both around 100 ns / call. The Pascal result was around > twice as long. Interestingly, the Pascal result for FPC 3.2.2 was > around 10% longer than the same source compiled with FPC 2.0.3 (the > assembler versions were obviously the same for both FPC versions; I > tried compiling it also with FPC 1.0.10 and the assembler versions > were more than three times slower due to missing support for the > nostackframe directive). > > I tested it under the AMD Athlon 1 GHz machine as well and again, the > results for LEA and ADD are basically equal (both 3.1 ns/call) and the > result for Pascal slightly more than twice (7.3 ns/call). However, > rather surprisingly for me, the overall test run was _much_ longer > there?! Finally, I tried compiling the test on a 64-bit machine (AMD > A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from > a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the > assembler version runs forever - well, certainly much longer than my > patience lasts. I haven't tried to analyze the reasons, but that's > what I get. > > Tomas > > > >> >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: >>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" >>> wrote: >>> >>> >>> Hii Kit, >>> This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. >>> I cannot answer your question, but if you prepare a test program, I >>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines >>> if it helps you in any way (at least I hope the 486 DX2 machine >>> should be still able to start ;-) ). >>> >>> Tomas >>> >>> ___ >>> fpc-devel maillist - fpc-devel@lists.freepascal.org >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel >>> >> ___ >> fpc-devel maillist - fpc-devel@lists.freepascal.org >> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > ___ > fpc-devel maillist -
Re: [fpc-devel] LEA instruction speed
Hi [nataraj@dflyHP ~]$ fpc ttt.pas Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: DragonFly for x86-64 Compiling ttt.pas Linking ttt /usr/local/bin/ld.bfd: warning: /usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing .note.GNU-stack section implies executable stack /usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be removed in a future version of the linker 121 lines compiled, 14.9 sec [nataraj@dflyHP ~]$ ./ttt Pascal control case: 6.7 ns/call Using LEA instruction: 4.2 ns/call Using ADD instructions: 4.0 ns/call Nataraj S Narayan Synergy Info Systems Software & Technology Consultants Ettumanoor, INDIA Ph:+91 9443211326 On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel < fpc-devel@lists.freepascal.org> wrote: > That's interesting; I am interested to see the assembly output for the > Pascal control cases. As for the 64-bit version, that was my fault > since the assembly language is for Microsoft's ABI rather than the > System V ABI, so it was checking a register with an undefined value. > Find attached the fixed test. > > Kit > > P.S. Results on my Intel(R) Core(TM) i7-10750H > > Pascal control case: 2.0 ns/call > Using LEA instruction: 1.7 ns/call > Using ADD instructions: 1.3 ns/call > > On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: > > On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: > > > > > > Hi Kit, > > > >> Do you think this should suffice? Originally it ran for 1,000,000 > >> repetitions but I fear that will take way too long on a 486, so I > >> reduced it to 10,000. > > > > OK, I tried it now. First of all, after turning on the old machine, I > > realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad > > memory. :-( I compiled and ran the test under OS/2 there (I was too > > lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any > > substantial difference. The ADD and LEA results were basically the > > same there, both around 100 ns / call. The Pascal result was around > > twice as long. Interestingly, the Pascal result for FPC 3.2.2 was > > around 10% longer than the same source compiled with FPC 2.0.3 (the > > assembler versions were obviously the same for both FPC versions; I > > tried compiling it also with FPC 1.0.10 and the assembler versions > > were more than three times slower due to missing support for the > > nostackframe directive). > > > > I tested it under the AMD Athlon 1 GHz machine as well and again, the > > results for LEA and ADD are basically equal (both 3.1 ns/call) and the > > result for Pascal slightly more than twice (7.3 ns/call). However, > > rather surprisingly for me, the overall test run was _much_ longer > > there?! Finally, I tried compiling the test on a 64-bit machine (AMD > > A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from > > a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the > > assembler version runs forever - well, certainly much longer than my > > patience lasts. I haven't tried to analyze the reasons, but that's > > what I get. > > > > Tomas > > > > > > > >> > >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: > >>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" > >>> wrote: > >>> > >>> > >>> Hii Kit, > >>> > This is mainly to Florian, but also to anyone else who can answer > the question - at which point did a complex LEA instruction (using > all three input operands and some other specific circumstances) get > slow? Preliminary research suggests the 486 was when it gained > extra latency, and then Sandy Bridge when it got particularly bad. > Icy Lake seems to be the architecture where faster LEA instructions > are reintroduced, but I'm not sure about AMD processors. > >>> I cannot answer your question, but if you prepare a test program, I > >>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines > >>> if it helps you in any way (at least I hope the 486 DX2 machine > >>> should be still able to start ;-) ). > >>> > >>> Tomas > >>> > >>> ___ > >>> fpc-devel maillist - fpc-devel@lists.freepascal.org > >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > >>> > >> ___ > >> fpc-devel maillist - fpc-devel@lists.freepascal.org > >> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > > ___ > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > >___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org
Re: [fpc-devel] LEA instruction speed
Sorry, ignore last attachment - I forgot to change a line of assembly (it was correct for x86_64-win64!!). Here is the corrected version. Kit On 08/10/2023 12:38, J. Gareth Moreton via fpc-devel wrote: Sorry, I got careless and was in a rush, as both the Pascal code is wrong and I didn't store the result of the benchmark test, hence the error check at the end returned a false negative. The benchmark code was from Rika's SHA-1 test code, which I didn't properly check, although I assumed the logic was to avoid counting the time of the internal loop as much as possible. I should have gone with my gut instinct and realised that wasn't the best method. I've attached the updated test (now called "blea" as it's a benchmark test) with your suggestions implemented, and an improved benchmarking system. I'm not used to specifying parameters in place of registers - I'm too used to needing total control! Your results from experiments with adding additional ADD instructions is expected, as LEA uses an AGU for computation, leaving the ALUs free for other tasks (like ADD), so LEA is better even if speed is equal. Kit On 08/10/2023 11:06, Marģers . via fpc-devel wrote: 1. why you leave "time:=..." in benchmark loop? It does add 50% of execution time per call. 2. Pascal version does not match assembler version. Had to fix it. //Result := X + Counter + $87654321; Result:=Result + X + $87654321; Result:=Result xor y; 3. Assembler functions can be unified to work under win64,win32, linux 64, linux 32 function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, y DEC y JNZ @Loop2 MOV EAX, Input end; 4. My results. Ryzen 2700x Pascal control case: 0.7 ns/call 0.0710 Using LEA instruction: 0.7 ns/call 0.0700 Using ADD instructions: 0.7 ns/call 0.0710 Even thou results are equal, i was able to add 4 independent ADD instructions around LEA while results didn't chance, but only 2 around ADD. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, Y DEC Y JNZ @Loop2 MOV Result, Input end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime; time: double; reps: cardinal; begin Result := Z; reps := 0; start := Now; repeat inc(reps); Result := proc(Result, X, internal_reps); until (reps >= 1); time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9; writeln(name, ': ', time:0:ord(time < 10), ' ns/call'); end; var Results: array[0..2] of LongWord; FailureCode: Integer; begin Results[0] := Benchmark(' Pascal control case', @Checksum_PAS, 500, 1000); Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 1000); Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 1000); FailureCode := 0; if (Results[0] <> Results[1]) then begin WriteLn('ERROR: Checksum_LEA doesn''t match control case'); FailureCode := FailureCode or 1; end; if (Results[0] <> Results[2]) then begin WriteLn('ERROR: Checksum_ADD doesn''t match control case'); FailureCode := FailureCode or 2 end; if FailureCode <> 0 then Halt(FailureCode); end. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Sorry, I got careless and was in a rush, as both the Pascal code is wrong and I didn't store the result of the benchmark test, hence the error check at the end returned a false negative. The benchmark code was from Rika's SHA-1 test code, which I didn't properly check, although I assumed the logic was to avoid counting the time of the internal loop as much as possible. I should have gone with my gut instinct and realised that wasn't the best method. I've attached the updated test (now called "blea" as it's a benchmark test) with your suggestions implemented, and an improved benchmarking system. I'm not used to specifying parameters in place of registers - I'm too used to needing total control! Your results from experiments with adding additional ADD instructions is expected, as LEA uses an AGU for computation, leaving the ALUs free for other tasks (like ADD), so LEA is better even if speed is equal. Kit On 08/10/2023 11:06, Marģers . via fpc-devel wrote: 1. why you leave "time:=..." in benchmark loop? It does add 50% of execution time per call. 2. Pascal version does not match assembler version. Had to fix it. //Result := X + Counter + $87654321; Result:=Result + X + $87654321; Result:=Result xor y; 3. Assembler functions can be unified to work under win64,win32, linux 64, linux 32 function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, y DEC y JNZ @Loop2 MOV EAX, Input end; 4. My results. Ryzen 2700x Pascal control case: 0.7 ns/call 0.0710 Using LEA instruction: 0.7 ns/call 0.0700 Using ADD instructions: 0.7 ns/call 0.0710 Even thou results are equal, i was able to add 4 independent ADD instructions around LEA while results didn't chance, but only 2 around ADD. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 } program blea; {$IF not defined(CPUX86) and not defined(CPUX86_64)} {$FATAL This test program requires an Intel x86 or x64 processor } {$ENDIF} {$MODE OBJFPC} {$ASMMODE Intel} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := Result + X + $87654321; Result := Result xor Counter; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: ADD Input, $87654321 ADD Input, X XOR Input, Y DEC Y JNZ @Loop1 MOV Result, Input end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, Y DEC Y JNZ @Loop2 MOV EAX, ECX end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime; time: double; reps: cardinal; begin Result := Z; reps := 0; start := Now; repeat inc(reps); Result := proc(Result, X, internal_reps); until (reps >= 1); time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9; writeln(name, ': ', time:0:ord(time < 10), ' ns/call'); end; var Results: array[0..2] of LongWord; FailureCode: Integer; begin Results[0] := Benchmark(' Pascal control case', @Checksum_PAS, 500, 1000); Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 1000); Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 1000); FailureCode := 0; if (Results[0] <> Results[1]) then begin WriteLn('ERROR: Checksum_LEA doesn''t match control case'); FailureCode := FailureCode or 1; end; if (Results[0] <> Results[2]) then begin WriteLn('ERROR: Checksum_ADD doesn''t match control case'); FailureCode := FailureCode or 2 end; if FailureCode <> 0 then Halt(FailureCode); end. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
1. why you leave "time:=..." in benchmark loop? It does add 50% of execution time per call. 2. Pascal version does not match assembler version. Had to fix it. //Result := X + Counter + $87654321; Result:=Result + X + $87654321; Result:=Result xor y; 3. Assembler functions can be unified to work under win64,win32, linux 64, linux 32 function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: LEA Input, [Input + X + $87654321] XOR Input, y DEC y JNZ @Loop2 MOV EAX, Input end; 4. My results. Ryzen 2700x Pascal control case: 0.7 ns/call 0.0710 Using LEA instruction: 0.7 ns/call 0.0700 Using ADD instructions: 0.7 ns/call 0.0710 Even thou results are equal, i was able to add 4 independent ADD instructions around LEA while results didn't chance, but only 2 around ADD. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
I'm still slightly curious, but if full optimisations make better code, then indeed it's probably not worth the effort. Your timings are incredibly helpful - thank you! If I understand, AMD A9 is the Excavator architecture, which implies that AMD processors don't suffer from the same latency with complex LEA instructions as Intel processors do. Looking at Agner Fog's tables, it looks like the slow LEA instructions only came about at Sandy Bridge, which for Free Pascal I think lines up with COREAVX. Even the Pentium-era processors have a 1-cycle LEA, and your testing on an AMD 486 shows it is at least as fast as two ADDs in a dependency chain. That should be all the information I need - thanks again! Kit On 07/10/2023 19:03, Tomas Hajny via fpc-devel wrote: On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote: That's interesting; I am interested to see the assembly output for the Pascal control cases. As for the 64-bit version, that was my fault since the assembly language is for Microsoft's ABI rather than the System V ABI, so it was checking a register with an undefined value. Find attached the fixed test. Kit P.S. Results on my Intel(R) Core(TM) i7-10750H Pascal control case: 2.0 ns/call Using LEA instruction: 1.7 ns/call Using ADD instructions: 1.3 ns/call OK. My results for the AMD A9 CPU mentioned previously and 32-bit trunk compiler (Linux) are: Pascal control case: 2.3 ns/call Using LEA instruction: 1.2 ns/call Using ADD instructions: 1.5 ns/call The same machine, the same operating environment, but a 64-bit trunk compiler: Pascal control case: 3.6 ns/call Using LEA instruction: 0.9 ns/call Using ADD instructions: 1.3 ns/call I tried compiling and running the test with all of FPC 2.0.4, 2.2.4, 2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized that all results (for both the assembler and Pascal versions) compiled with anything older than 3.2.2 are an order of magnitude faster than with 3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8 ns/call with Pascal / 4 ns/call with assembler versions). This means that the comparison is obviously spoiled with something unrelated. Moreover, I noticed that when compiling with the highest level of optimizations, the Pascal version compiled for i386 is as fast or even little bit faster than the assembler version. I didn't do that previously, thus the longer time for the older compiler version probably isn't relevant. From this point of view, it probably doesn't make sense to spend time on comparing the generated code. Tomas On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: Hi Kit, Do you think this should suffice? Originally it ran for 1,000,000 repetitions but I fear that will take way too long on a 486, so I reduced it to 10,000. OK, I tried it now. First of all, after turning on the old machine, I realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad memory. :-( I compiled and ran the test under OS/2 there (I was too lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any substantial difference. The ADD and LEA results were basically the same there, both around 100 ns / call. The Pascal result was around twice as long. Interestingly, the Pascal result for FPC 3.2.2 was around 10% longer than the same source compiled with FPC 2.0.3 (the assembler versions were obviously the same for both FPC versions; I tried compiling it also with FPC 1.0.10 and the assembler versions were more than three times slower due to missing support for the nostackframe directive). I tested it under the AMD Athlon 1 GHz machine as well and again, the results for LEA and ADD are basically equal (both 3.1 ns/call) and the result for Pascal slightly more than twice (7.3 ns/call). However, rather surprisingly for me, the overall test run was _much_ longer there?! Finally, I tried compiling the test on a 64-bit machine (AMD A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the assembler version runs forever - well, certainly much longer than my patience lasts. I haven't tried to analyze the reasons, but that's what I get. Tomas On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test
Re: [fpc-devel] LEA instruction speed
On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote: That's interesting; I am interested to see the assembly output for the Pascal control cases. As for the 64-bit version, that was my fault since the assembly language is for Microsoft's ABI rather than the System V ABI, so it was checking a register with an undefined value. Find attached the fixed test. Kit P.S. Results on my Intel(R) Core(TM) i7-10750H Pascal control case: 2.0 ns/call Using LEA instruction: 1.7 ns/call Using ADD instructions: 1.3 ns/call OK. My results for the AMD A9 CPU mentioned previously and 32-bit trunk compiler (Linux) are: Pascal control case: 2.3 ns/call Using LEA instruction: 1.2 ns/call Using ADD instructions: 1.5 ns/call The same machine, the same operating environment, but a 64-bit trunk compiler: Pascal control case: 3.6 ns/call Using LEA instruction: 0.9 ns/call Using ADD instructions: 1.3 ns/call I tried compiling and running the test with all of FPC 2.0.4, 2.2.4, 2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized that all results (for both the assembler and Pascal versions) compiled with anything older than 3.2.2 are an order of magnitude faster than with 3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8 ns/call with Pascal / 4 ns/call with assembler versions). This means that the comparison is obviously spoiled with something unrelated. Moreover, I noticed that when compiling with the highest level of optimizations, the Pascal version compiled for i386 is as fast or even little bit faster than the assembler version. I didn't do that previously, thus the longer time for the older compiler version probably isn't relevant. From this point of view, it probably doesn't make sense to spend time on comparing the generated code. Tomas On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: Hi Kit, Do you think this should suffice? Originally it ran for 1,000,000 repetitions but I fear that will take way too long on a 486, so I reduced it to 10,000. OK, I tried it now. First of all, after turning on the old machine, I realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad memory. :-( I compiled and ran the test under OS/2 there (I was too lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any substantial difference. The ADD and LEA results were basically the same there, both around 100 ns / call. The Pascal result was around twice as long. Interestingly, the Pascal result for FPC 3.2.2 was around 10% longer than the same source compiled with FPC 2.0.3 (the assembler versions were obviously the same for both FPC versions; I tried compiling it also with FPC 1.0.10 and the assembler versions were more than three times slower due to missing support for the nostackframe directive). I tested it under the AMD Athlon 1 GHz machine as well and again, the results for LEA and ADD are basically equal (both 3.1 ns/call) and the result for Pascal slightly more than twice (7.3 ns/call). However, rather surprisingly for me, the overall test run was _much_ longer there?! Finally, I tried compiling the test on a 64-bit machine (AMD A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the assembler version runs forever - well, certainly much longer than my patience lasts. I haven't tried to analyze the reasons, but that's what I get. Tomas On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test program, I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in any way (at least I hope the 486 DX2 machine should be still able to start ;-) ). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist -
Re: [fpc-devel] LEA instruction speed
That's interesting; I am interested to see the assembly output for the Pascal control cases. As for the 64-bit version, that was my fault since the assembly language is for Microsoft's ABI rather than the System V ABI, so it was checking a register with an undefined value. Find attached the fixed test. Kit P.S. Results on my Intel(R) Core(TM) i7-10750H Pascal control case: 2.0 ns/call Using LEA instruction: 1.7 ns/call Using ADD instructions: 1.3 ns/call On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote: On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: Hi Kit, Do you think this should suffice? Originally it ran for 1,000,000 repetitions but I fear that will take way too long on a 486, so I reduced it to 10,000. OK, I tried it now. First of all, after turning on the old machine, I realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad memory. :-( I compiled and ran the test under OS/2 there (I was too lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any substantial difference. The ADD and LEA results were basically the same there, both around 100 ns / call. The Pascal result was around twice as long. Interestingly, the Pascal result for FPC 3.2.2 was around 10% longer than the same source compiled with FPC 2.0.3 (the assembler versions were obviously the same for both FPC versions; I tried compiling it also with FPC 1.0.10 and the assembler versions were more than three times slower due to missing support for the nostackframe directive). I tested it under the AMD Athlon 1 GHz machine as well and again, the results for LEA and ADD are basically equal (both 3.1 ns/call) and the result for Pascal slightly more than twice (7.3 ns/call). However, rather surprisingly for me, the overall test run was _much_ longer there?! Finally, I tried compiling the test on a 64-bit machine (AMD A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the assembler version runs forever - well, certainly much longer than my patience lasts. I haven't tried to analyze the reasons, but that's what I get. Tomas On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test program, I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in any way (at least I hope the 486 DX2 machine should be still able to start ;-) ). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel program leatest; {$MODE OBJFPC} {$ASMMODE Intel} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := X + Counter + $87654321; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: {$ifdef CPUX86_64} {$ifdef MSWINDOWS} ADD ECX, $87654321 ADD ECX, EDX XOR ECX, R8D DEC R8D JNZ @Loop1 MOV EAX, ECX {$else MSWINDOWS} ADD EDI, $87654321 ADD EDI, ESI XOR EDI, EDX DEC EDX JNZ @Loop1 MOV EAX, EDI {$endif MSWINDOWS} {$else CPUX86_64} ADD EAX, $87654321 ADD EAX, EDX XOR EAX, ECX DEC ECX JNZ @Loop1 {$endif CPUX86_64} end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: {$ifdef CPUX86_64} {$ifdef MSWINDOWS} LEA ECX, [ECX + EDX + $87654321] XOR ECX, R8D DEC R8D JNZ @Loop2 MOV EAX, ECX {$else MSWINDOWS} LEA EDI, [EDI + ESI + $87654321] XOR EDI, EDX DEC EDX JNZ @Loop2 MOV EAX, EDI {$endif MSWINDOWS} {$else CPUX86_64} LEA EAX, [EAX + EDX + $87654321] XOR EAX, ECX DEC ECX JNZ @Loop2 {$endif CPUX86_64} end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps =
Re: [fpc-devel] LEA instruction speed
On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote: Hi Kit, Do you think this should suffice? Originally it ran for 1,000,000 repetitions but I fear that will take way too long on a 486, so I reduced it to 10,000. OK, I tried it now. First of all, after turning on the old machine, I realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad memory. :-( I compiled and ran the test under OS/2 there (I was too lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any substantial difference. The ADD and LEA results were basically the same there, both around 100 ns / call. The Pascal result was around twice as long. Interestingly, the Pascal result for FPC 3.2.2 was around 10% longer than the same source compiled with FPC 2.0.3 (the assembler versions were obviously the same for both FPC versions; I tried compiling it also with FPC 1.0.10 and the assembler versions were more than three times slower due to missing support for the nostackframe directive). I tested it under the AMD Athlon 1 GHz machine as well and again, the results for LEA and ADD are basically equal (both 3.1 ns/call) and the result for Pascal slightly more than twice (7.3 ns/call). However, rather surprisingly for me, the overall test run was _much_ longer there?! Finally, I tried compiling the test on a 64-bit machine (AMD A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the assembler version runs forever - well, certainly much longer than my patience lasts. I haven't tried to analyze the reasons, but that's what I get. Tomas On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test program, I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in any way (at least I hope the 486 DX2 machine should be still able to start ;-) ). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Hi Tomas, Do you think this should suffice? Originally it ran for 1,000,000 repetitions but I fear that will take way too long on a 486, so I reduced it to 10,000. Kit On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test program, I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in any way (at least I hope the 486 DX2 machine should be still able to start ;-) ). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel program leatest; {$MODE OBJFPC} {$ASMMODE Intel} uses SysUtils; type TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord; function Checksum_PAS(const Input, X, Y: LongWord): LongWord; var Counter: LongWord; begin Result := Input; Counter := Y; while (Counter > 0) do begin Result := X + Counter + $87654321; Dec(Counter); end; end; function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop1: {$ifdef CPUX86_64} ADD ECX, $87654321 ADD ECX, EDX XOR ECX, R8D DEC R8D JNZ @Loop1 MOV EAX, ECX {$else CPUX86_64} ADD EAX, $87654321 ADD EAX, EDX XOR EAX, ECX DEC ECX JNZ @Loop1 {$endif CPUX86_64} end; function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe; asm @Loop2: {$ifdef CPUX86_64} LEA ECX, [ECX + EDX + $87654321] XOR ECX, R8D DEC R8D JNZ @Loop2 MOV EAX, ECX {$else CPUX86_64} LEA EAX, [EAX + EDX + $87654321] XOR EAX, ECX DEC ECX JNZ @Loop2 {$endif CPUX86_64} end; function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord; const internal_reps = 1000; var start: TDateTime; time: double; reps: cardinal; begin Result := Z; reps := 0; start := Now; repeat inc(reps); proc(Result, X, internal_reps); time := (Now - start) * SecsPerDay; until (reps >= 1); time := time / reps / internal_reps * 1e9; writeln(name, ': ', time:0:ord(time < 10), ' ns/call'); end; var Results: array[0..2] of LongWord; FailureCode: Integer; begin Results[0] := Benchmark(' Pascal control case', @Checksum_PAS, 500, 1000); Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 1000); Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 1000); FailureCode := 0; if (Results[0] <> Results[1]) then begin WriteLn('ERROR: Checksum_LEA doesn''t match control case'); FailureCode := FailureCode or 1; end; if (Results[0] <> Results[2]) then begin WriteLn('ERROR: Checksum_ADD doesn''t match control case'); FailureCode := FailureCode or 2 end; if FailureCode <> 0 then Halt(FailureCode); end.___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
What should I call a new sub-CPU option? Should it be "ICELAKE" or is there a better name like "CORE10" or "COREX" (X being the Roman numeral for 10, standing in for the 10th generation of Intel Core)? Kit On 03/10/2023 08:02, Florian Klämpfl via fpc-devel wrote: Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel : Hi everyone, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Maybe check Agner’s list? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)? If it doesn’t fit in the existing ones, you can always add new ones Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
I don't think any of them currently fit, although Zen 3 is later than Ice Lake, but I'm not sure if it has a faster LEA or not. I'll do some investigation. I'll take up Tomas' offer on the 486 test though. Personally I think the best test might actually be one of the recently-optimised cryptographic functions. Kit On 03/10/2023 08:02, Florian Klämpfl via fpc-devel wrote: Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel : Hi everyone, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Maybe check Agner’s list? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)? If it doesn’t fit in the existing ones, you can always add new ones Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
> Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel > : > > Hi everyone, > > This is mainly to Florian, but also to anyone else who can answer the > question - at which point did a complex LEA instruction (using all three > input operands and some other specific circumstances) get slow? Maybe check Agner’s list? > Preliminary research suggests the 486 was when it gained extra latency, and > then Sandy Bridge when it got particularly bad. Icy Lake seems to be the > architecture where faster LEA instructions are reintroduced, but I'm not sure > about AMD processors. > > Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a > better name or does it fall under one of our categories already (CORE_AVX2 or > ZEN3)? If it doesn’t fit in the existing ones, you can always add new ones > > Kit > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
Hmmm, could be fun to attempt to test - I'll see what I can set up. Kit On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote: On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test program, I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in any way (at least I hope the 486 DX2 machine should be still able to start ;-) ). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" wrote: Hii Kit, >This is mainly to Florian, but also to anyone else who can answer the question >- at which point did a complex LEA instruction (using all three input operands >and some other specific circumstances) get slow? Preliminary research >suggests the 486 was when it gained extra latency, and then Sandy Bridge when >it got particularly bad. Icy Lake seems to be the architecture where faster >LEA instructions are reintroduced, but I'm not sure about AMD processors. I cannot answer your question, but if you prepare a test program, I can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in any way (at least I hope the 486 DX2 machine should be still able to start ;-) ). Tomas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] LEA instruction speed
(And I meant "Ice Lake", not "Icy Lake") On 03/10/2023 02:32, J. Gareth Moreton via fpc-devel wrote: Hi everyone, This is mainly to Florian, but also to anyone else who can answer the question - at which point did a complex LEA instruction (using all three input operands and some other specific circumstances) get slow? Preliminary research suggests the 486 was when it gained extra latency, and then Sandy Bridge when it got particularly bad. Icy Lake seems to be the architecture where faster LEA instructions are reintroduced, but I'm not sure about AMD processors. Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)? Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel