Re: [fpc-devel] LEA instruction speed

2023-10-27 Thread J. Gareth Moreton via fpc-devel

I should have figured.  Thank you!

Kit

On 27/10/2023 01:51, Nikolay Nikolov via fpc-devel wrote:


On 10/11/23 11:21, Tomas Hajny via fpc-devel wrote:

On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote:

Sweet, thank you.  Would you be willing to share your modified test's
source? I was worried that if CPUID wasn't present it would cause a
SIGILL.


Sure, attached, but I didn't do anything special - I modified it in a 
way allowing easy disabling of this detection for x86 by disabling 
definition of a conditional symbol added to the source and I was 
prepared to recompile with the functionality disabled on the old AMD 
DX4 if needed. However, I didn't need to do so - the AMD DX4 machine 
simply ignored it and chose the branch used in case of missing 
support for the particular CPUID function. I have no idea if this 
might be due to some protection in OS/2 Warp 4 (used for compiling 
and running the test on that machine) potentially masking that 
exception, or what was the reason. Apparently, it should be possible 
to detect CPUID availability (albeit not 100% reliably), see 
https://wiki.osdev.org/CPUID, but I didn't use that.


There's CPUID support detection code in the Free Pascal RTL for i8086 
and i386. It's in unit cpu:


function cpuid_support: boolean;

Nikolay



Tomas




On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well 
if possible and added an IFDEFed distinction of 32-bits versus 
64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-26 Thread Nikolay Nikolov via fpc-devel


On 10/11/23 11:21, Tomas Hajny via fpc-devel wrote:

On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote:

Sweet, thank you.  Would you be willing to share your modified test's
source? I was worried that if CPUID wasn't present it would cause a
SIGILL.


Sure, attached, but I didn't do anything special - I modified it in a 
way allowing easy disabling of this detection for x86 by disabling 
definition of a conditional symbol added to the source and I was 
prepared to recompile with the functionality disabled on the old AMD 
DX4 if needed. However, I didn't need to do so - the AMD DX4 machine 
simply ignored it and chose the branch used in case of missing support 
for the particular CPUID function. I have no idea if this might be due 
to some protection in OS/2 Warp 4 (used for compiling and running the 
test on that machine) potentially masking that exception, or what was 
the reason. Apparently, it should be possible to detect CPUID 
availability (albeit not 100% reliably), see 
https://wiki.osdev.org/CPUID, but I didn't use that.


There's CPUID support detection code in the Free Pascal RTL for i8086 
and i386. It's in unit cpu:


function cpuid_support: boolean;

Nikolay



Tomas




On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well 
if possible and added an IFDEFed distinction of 32-bits versus 
64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel
It was a thought that crossed my mind when Stefan pointed out the 
translated Google Benchmark, but given that it hasn't yet been adapted 
to work outside of i386 and x86_64, you are right that it probably 
shouldn't be used for the time being.  The framework uses CPU timings to 
decide how many iterations to run, so obtaining such metrics is 
essential for its operation.  While I can probably get it to work for 
aarch64-linux, I don't know the first thing about polling CPUs on 
platforms I don't have access to!  It was a nice experiment in the meantime.


In regards to "blea" being in the test suite, i haven't yet put it into 
the normal test suite (using an include wrapper like I did with 
"tests/bench/bcase.pp") since it's primarily to evalutate LEA timings 
rather than testing compiler efficiency.  It's more of a 'utility' 
test.  The feedback from others has proven useful in determining the 
correctness of the new optimisation hint, which I intend to use to make 
the i386/x86_64 peephole optimizer smarter in regards to using LEA 
statements.


Kit

On 13/10/2023 16:36, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 17:08, J. Gareth Moreton via fpc-devel wrote:

Interesting!  That's a bug report to send to the maintainers of the
framework.  I'll need to have them fix it before I'd be willing to try
again with its use in FPC.

Removed the reference.  Apologies - I'm rushing a bit.


BTW, it's IMHO questionable whether a benchmark framework restricted 
to just a subset of targets supported for the given architecture 
should be really used within FPC source codes (if that's your 
potential intention anyway). If it was intended for the testsuite, it 
would immediately fail at compile time when checking the testsuite 
under some other target, because the test doesn't specify that its use 
should be restricted to certain targets (it's only restricted for i386 
and x86_64 regardless of the operating system - that's fine for the 
original version, but not for the benchmark framework).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread Tomas Hajny via fpc-devel

On 2023-10-13 17:08, J. Gareth Moreton via fpc-devel wrote:

Interesting!  That's a bug report to send to the maintainers of the
framework.  I'll need to have them fix it before I'd be willing to try
again with its use in FPC.

Removed the reference.  Apologies - I'm rushing a bit.


BTW, it's IMHO questionable whether a benchmark framework restricted to 
just a subset of targets supported for the given architecture should be 
really used within FPC source codes (if that's your potential intention 
anyway). If it was intended for the testsuite, it would immediately fail 
at compile time when checking the testsuite under some other target, 
because the test doesn't specify that its use should be restricted to 
certain targets (it's only restricted for i386 and x86_64 regardless of 
the operating system - that's fine for the original version, but not for 
the benchmark framework).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel

This one's for you Stefan!

https://github.com/spring4d/benchmark/issues/4

Kit

On 13/10/2023 16:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote:

GetLogicalProcessorInformation returns a Boolean - if false, an error
occurred, and is handled as follows:

DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation:
' + GetLastError.ToString);

GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework,
I'm going to remove it from "blea" for now.  As it stands, please find
attached the test that appears in the merge request:
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


The attached version still contained reference to the framework.

The problem with 32-bit compilation of the framework was due to a 
missing stdcall calling convention in the 
GetLogicalProcessorInformation declaration.


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel
Interesting!  That's a bug report to send to the maintainers of the 
framework.  I'll need to have them fix it before I'd be willing to try 
again with its use in FPC.


Removed the reference.  Apologies - I'm rushing a bit.

Kit

On 13/10/2023 16:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote:

GetLogicalProcessorInformation returns a Boolean - if false, an error
occurred, and is handled as follows:

DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation:
' + GetLastError.ToString);

GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework,
I'm going to remove it from "blea" for now.  As it stands, please find
attached the test that appears in the merge request:
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


The attached version still contained reference to the framework.

The problem with 32-bit compilation of the framework was due to a 
missing stdcall calling convention in the 
GetLogicalProcessorInformation declaration.


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef DETECTCPU}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$ifdef CPUX86_64}
  PUSH RBX
{$else CPUX86_64}
  PUSH EBX
{$endif CPUX86_64}
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
{$ifdef CPUX86_64}
  LEA  R8,  [RIP + CPUName]
{$endif CPUX86_64}
  MOV  EAX, $8002
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
{$else CPUX86_64}
  MOV  [CPUName], EAX
  MOV  [CPUName + 4], EBX
  MOV  [CPUName + 8], ECX
  MOV  [CPUName + 12], EDX
{$endif CPUX86_64}
  MOV  EAX, $8003
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
{$else CPUX86_64}
  MOV  [CPUName + 16], EAX
  MOV  [CPUName + 20], EBX
  MOV  [CPUName + 24], ECX
  MOV  [CPUName + 28], EDX
{$endif CPUX86_64}
  MOV  EAX, $8004
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
{$else CPUX86_64}
  MOV  [CPUName + 32], EAX
  MOV  [CPUName + 36], EBX
  MOV  [CPUName + 40], ECX
  MOV  [CPUName + 44], EDX
  MOV  BYTE PTR [CPUName + 48], 0
{$endif CPUX86_64}
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
{$ifdef CPUX86_64}
  POP  RBX
{$else CPUX86_64}
  POP  EBX
{$endif CPUX86_64}
end;
{$else DETECTCPU}
function FillBrandName: Boolean; inline;
begin
  Result := False;  
end;
{$endif DETECTPU}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X - 2023406815] {+$87654321 in decimal}
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  Write(name, ': ');
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 10);
  time := Now - start) * SecsPerDay) * 1e9) / internal_reps) / reps;
  WriteLn(time:0:(2 * ord(time < 10)), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
{$ifdef CPUX86_64}
  Write('64-bit');
{$else CPUX86_64}
  Write('32-bit');
{$endif CPUX86_64}
  if FillBrandName then
begin
  WriteLn(' CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  WriteLn('--', CpuName);
end;
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread Tomas Hajny via fpc-devel

On 2023-10-13 16:25, J. Gareth Moreton via fpc-devel wrote:

GetLogicalProcessorInformation returns a Boolean - if false, an error
occurred, and is handled as follows:

DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation:
' + GetLastError.ToString);

GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework,
I'm going to remove it from "blea" for now.  As it stands, please find
attached the test that appears in the merge request:
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


The attached version still contained reference to the framework.

The problem with 32-bit compilation of the framework was due to a 
missing stdcall calling convention in the GetLogicalProcessorInformation 
declaration.


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel
GetLogicalProcessorInformation returns a Boolean - if false, an error 
occurred, and is handled as follows:


DiagnoseAndExit('Failed during call to GetLogicalProcessorInformation: ' 
+ GetLastError.ToString);


GetLastError = 8 indicates "out of memory", which I will say is odd.

Nevertheless, because of such teething problems with the framework, I'm 
going to remove it from "blea" for now.  As it stands, please find 
attached the test that appears in the merge request: 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


Kit

On 13/10/2023 08:34, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 09:26, Tomas Hajny wrote:

On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote:

So an update.

 .
 .

The latest version of blea.pp doesn't compile with a 32-bit compiler -
line 76 contains an unconditional reference to R8 register, which
obviously doesn't for the 32-bit mode.


BTW, the line shouldn't be necessary at all, because global variables 
should be initialized to 0 on program start anyway as far as I know.


When fixing the problem above, compiling to 32-bit mode and running 
it, the test fails with an error in GetLogicalProcessorInformation (it 
states "8" in place of the error information; I wonder if it isn't 
misinterpreted, because 8 is number of logical CPUs on the machine 
used for running the test).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils, Spring.Benchmark in 'spring/Spring.Benchmark.pp';
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef DETECTCPU}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$ifdef CPUX86_64}
  PUSH RBX
{$else CPUX86_64}
  PUSH EBX
{$endif CPUX86_64}
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
{$ifdef CPUX86_64}
  LEA  R8,  [RIP + CPUName]
{$endif CPUX86_64}
  MOV  EAX, $8002
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
{$else CPUX86_64}
  MOV  [CPUName], EAX
  MOV  [CPUName + 4], EBX
  MOV  [CPUName + 8], ECX
  MOV  [CPUName + 12], EDX
{$endif CPUX86_64}
  MOV  EAX, $8003
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
{$else CPUX86_64}
  MOV  [CPUName + 16], EAX
  MOV  [CPUName + 20], EBX
  MOV  [CPUName + 24], ECX
  MOV  [CPUName + 28], EDX
{$endif CPUX86_64}
  MOV  EAX, $8004
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
{$else CPUX86_64}
  MOV  [CPUName + 32], EAX
  MOV  [CPUName + 36], EBX
  MOV  [CPUName + 40], ECX
  MOV  [CPUName + 44], EDX
  MOV  BYTE PTR [CPUName + 48], 0
{$endif CPUX86_64}
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
{$ifdef CPUX86_64}
  POP  RBX
{$else CPUX86_64}
  POP  EBX
{$endif CPUX86_64}
end;
{$else DETECTCPU}
function FillBrandName: Boolean; inline;
begin
  Result := False;  
end;
{$endif DETECTPU}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X - 2023406815] {+$87654321 in decimal}
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  Write(name, ': ');
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 10);
  time := Now - start) * SecsPerDay) * 1e9) / internal_reps) / reps;
  WriteLn(time:0:(2 * ord(time < 10)), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
  if FillBrandName then
begin
  WriteLn('CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  WriteLn('--', CpuName);
end;
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA 

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel
Oops - that was a silly mistake of mine with R8.  As for the other 
error, that sounds like it's in the third party benchmark suite.  I'll 
do some investigating on my virtual machine.


In the meantime, here's the fixed test with the stray R8 call properly 
filtered out on i386 (it's replaced with "CPUName" on 32-bit).  I wasn't 
sure if global variables were initialised or not, hence me playing safe.


Kit

On 13/10/2023 08:34, Tomas Hajny via fpc-devel wrote:

On 2023-10-13 09:26, Tomas Hajny wrote:

On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote:

So an update.

 .
 .

The latest version of blea.pp doesn't compile with a 32-bit compiler -
line 76 contains an unconditional reference to R8 register, which
obviously doesn't for the 32-bit mode.


BTW, the line shouldn't be necessary at all, because global variables 
should be initialized to 0 on program start anyway as far as I know.


When fixing the problem above, compiling to 32-bit mode and running 
it, the test fails with an error in GetLogicalProcessorInformation (it 
states "8" in place of the error information; I wonder if it isn't 
misinterpreted, because 8 is number of logical CPUs on the machine 
used for running the test).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils, Spring.Benchmark in 'spring/Spring.Benchmark.pp';
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef DETECTCPU}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$ifdef CPUX86_64}
  PUSH RBX
{$else CPUX86_64}
  PUSH EBX
{$endif CPUX86_64}
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
{$ifdef CPUX86_64}
  LEA  R8,  [RIP + CPUName]
{$endif CPUX86_64}
  MOV  EAX, $8002
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
{$else CPUX86_64}
  MOV  [CPUName], EAX
  MOV  [CPUName + 4], EBX
  MOV  [CPUName + 8], ECX
  MOV  [CPUName + 12], EDX
{$endif CPUX86_64}
  MOV  EAX, $8003
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
{$else CPUX86_64}
  MOV  [CPUName + 16], EAX
  MOV  [CPUName + 20], EBX
  MOV  [CPUName + 24], ECX
  MOV  [CPUName + 28], EDX
{$endif CPUX86_64}
  MOV  EAX, $8004
  CPUID
{$ifdef CPUX86_64}
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
{$else CPUX86_64}
  MOV  [CPUName + 32], EAX
  MOV  [CPUName + 36], EBX
  MOV  [CPUName + 40], ECX
  MOV  [CPUName + 44], EDX
  MOV  BYTE PTR [CPUName + 48], 0
{$endif CPUX86_64}
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
{$ifdef CPUX86_64}
  POP  RBX
{$else CPUX86_64}
  POP  EBX
{$endif CPUX86_64}
end;
{$else DETECTCPU}
function FillBrandName: Boolean; inline;
begin
  Result := False;  
end;
{$endif DETECTPU}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X -2023406815] { -2023406815 = $87654321 }
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

const
  internal_reps = 1000;

procedure BM_Checksum_PAS(const State: TState);
var
  S: TState.TValue; Z, X: LongWord;
begin
  Z := 500;
  X := 1000;
  for S in State do
  begin
Checksum_PAS(Z, X, internal_reps);
  end;
end;

procedure BM_Checksum_LEA(const State: TState);
var
  S: TState.TValue; Z, X: LongWord;
begin
  Z := 500;
  X := 1000;
  for S in State do
  begin
Checksum_LEA(Z, X, internal_reps);
  end;
end;

procedure BM_Checksum_ADD(const State: TState);
var
  S: TState.TValue; Z, X: LongWord;
begin
  Z := 500;
  X := 1000;
  for S in State do
  begin
Checksum_ADD(Z, X, internal_reps);
  end;
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
{$IFDEF CPUX86}
  WriteLn ('32 bits:');
{$ENDIF CPUX86}
{$IFDEF CPUX86_64}
  WriteLn ('64 bits:');
{$ENDIF CPUX86_64}
  if FillBrandName then
begin
  WriteLn('CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  

Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread Tomas Hajny via fpc-devel

On 2023-10-13 09:26, Tomas Hajny wrote:

On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote:

So an update.

 .
 .

The latest version of blea.pp doesn't compile with a 32-bit compiler -
line 76 contains an unconditional reference to R8 register, which
obviously doesn't for the 32-bit mode.


BTW, the line shouldn't be necessary at all, because global variables 
should be initialized to 0 on program start anyway as far as I know.


When fixing the problem above, compiling to 32-bit mode and running it, 
the test fails with an error in GetLogicalProcessorInformation (it 
states "8" in place of the error information; I wonder if it isn't 
misinterpreted, because 8 is number of logical CPUs on the machine used 
for running the test).


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread Tomas Hajny via fpc-devel

On 2023-10-12 20:02, J. Gareth Moreton via fpc-devel wrote:

So an update.

 .
 .

The latest version of blea.pp doesn't compile with a 32-bit compiler - 
line 76 contains an unconditional reference to R8 register, which 
obviously doesn't for the 32-bit mode.


Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-13 Thread J. Gareth Moreton via fpc-devel

So an update.

I've added Spring.Benchmark to "tests/bench/spring" on my local branch, 
along with its readme and licence file.  It seems to work quite well 
even if it feels a bit like overkill for this small a benchmark.  Still, 
I've attached the version with Stefan's translated Google Benchmark unit 
to see what people think.  A couple of things to note:


 * Time metrics are now in thousands of nanoseconds because the 1,000
   repetitions of the internal loop (used to drown out the overhead of
   the function call) are no longer divided out.
 * Requires the fcl-base, rtl-objpas and regexpr packages.

I also made a mistake with the compiler flags.  I had added 
CPUX86_HINT_FAST_3COMP_ADDR_16 to indicate that a LEA instruction with 
16-bit operands is fast, since the timing is often different to the 
32/64-bit versions.  However, under i386 and x86_64, the assembler 
doesn't accept 16-bit operands!  I have therefore removed it for i386 
and x86_64, although I left it in for i8086 (even though it probably 
won't be used) because the Pentium 4 has a slow 16-bit LEA instruction.


However, the proposed COREX CPU option now has the exact same flags as 
ZEN3.  Should it be removed, or kept for clarity and future expansion?


Kit

P.S. Sorry for the size of the ZIP.
<>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-11 Thread Tomas Hajny via fpc-devel

On 2023-10-11 04:15, J. Gareth Moreton via fpc-devel wrote:

Sweet, thank you.  Would you be willing to share your modified test's
source? I was worried that if CPUID wasn't present it would cause a
SIGILL.


Sure, attached, but I didn't do anything special - I modified it in a 
way allowing easy disabling of this detection for x86 by disabling 
definition of a conditional symbol added to the source and I was 
prepared to recompile with the functionality disabled on the old AMD DX4 
if needed. However, I didn't need to do so - the AMD DX4 machine simply 
ignored it and chose the branch used in case of missing support for the 
particular CPUID function. I have no idea if this might be due to some 
protection in OS/2 Warp 4 (used for compiling and running the test on 
that machine) potentially masking that exception, or what was the 
reason. Apparently, it should be possible to detect CPUID availability 
(albeit not 100% reliably), see https://wiki.osdev.org/CPUID, but I 
didn't use that.


Tomas




On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well if 
possible and added an IFDEFed distinction of 32-bits versus 64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

{$DEFINE DETECTCPU}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef CPUX86_64}
function FillBrandName: Boolean; assembler; nostackframe;
asm
  PUSH RBX
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
  LEA  R8,  [RIP + CPUName]
  MOV  EAX, $8002
  CPUID
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
  MOV  EAX, $8003
  CPUID
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
  MOV  EAX, $8004
  CPUID
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
  POP  RBX
end;
{$else CPUX86_64}
function FillBrandName: Boolean; assembler; nostackframe;
asm
{$IFDEF DETECTCPU}
  push ebx

  mov eax, $8000
  cpuid
  cmp eax, $8004
  jb @not_supported

  lea esi, CPUName

  mov eax, 8002h
  cpuid
  mov [esi], eax
  mov [esi+4], ebx
  mov [esi+8], ecx
  mov [esi+12], edx

  mov eax, 8003h
  cpuid
  mov [esi+16], eax
  mov [esi+20], ebx
  mov [esi+24], ecx
  mov [esi+28], edx

  mov eax, 8004h
  cpuid
  mov [esi+32], eax
  mov [esi+36], ebx
  mov [esi+40], ecx
  mov [esi+44], edx

  mov eax, 1
  jmp @exit

@not_supported:
  xor eax, eax

@exit:
  pop ebx
{$ELSE DETECTCPU}
  xor eax, eax
{$ENDIF DETECTPU}
end;
{$endif CPUX86_64}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV 

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel
The LEA and ADD times are close enough that I can consider them 
identical.  And Braswell (the architecture behind that brand of Celeron) 
doesn't support AVX, I don't think, so that lines up with COREI having a 
fast LEA instruction but not COREAVX.


Given the many different x86-compatible CPUs, I wonder if we need to 
document the best compiler parameters for end users in some way (e.g. so 
it can be coded in a device driver installer so the most optimised 
binary can be installed for a given CPU architecture).


Kit

On 11/10/2023 05:56, Christo Crause wrote:

On Tue, Oct 10, 2023 at 11:13 AM J. Gareth Moreton via fpc-devel
 wrote:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I
had reduced it to 10,000 because, coupled with the 1,000 iterations in
the subroutines themselves, would have led to 1,000,000,000 passes and
hence would take in the region of five to ten minutes to complete for a
16 MHz 386, for example.  Rika's suggestion of running as many
iterations as needed until, say, 5 seconds elapses, would help but the
timing measurements would cause a lot of latency and will be imprecise
on very slow routines.  Still, let's see if 100,000 gives better results
for you.

Kit

Results on a modest CPU:

CPU =   Intel(R) Celeron(R) CPU  N3050  @ 1.60GHz
-
Pascal control case: 6.71 ns/call
  Using LEA instruction: 2.09 ns/call
Using ADD instructions: 2.05 ns/call

32 bits:
Pascal control case: 6.78 ns/call
  Using LEA instruction: 2.16 ns/call
Using ADD instructions: 2.09 ns/call

Results show a bit of variance, above numbers are more or less typical.

Christo


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread Christo Crause via fpc-devel
On Tue, Oct 10, 2023 at 11:13 AM J. Gareth Moreton via fpc-devel
 wrote:
>
> Thanks Tomas,
>
> Nothing is broken, but the timing measurement isn't precise enough.
>
> Normally I have a much higher iteration count (e.g. 1,000,000), but I
> had reduced it to 10,000 because, coupled with the 1,000 iterations in
> the subroutines themselves, would have led to 1,000,000,000 passes and
> hence would take in the region of five to ten minutes to complete for a
> 16 MHz 386, for example.  Rika's suggestion of running as many
> iterations as needed until, say, 5 seconds elapses, would help but the
> timing measurements would cause a lot of latency and will be imprecise
> on very slow routines.  Still, let's see if 100,000 gives better results
> for you.
>
> Kit

Results on a modest CPU:

CPU =   Intel(R) Celeron(R) CPU  N3050  @ 1.60GHz
-
   Pascal control case: 6.71 ns/call
 Using LEA instruction: 2.09 ns/call
Using ADD instructions: 2.05 ns/call

32 bits:
   Pascal control case: 6.78 ns/call
 Using LEA instruction: 2.16 ns/call
Using ADD instructions: 2.09 ns/call

Results show a bit of variance, above numbers are more or less typical.

Christo
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel
Sweet, thank you.  Would you be willing to share your modified test's 
source? I was worried that if CPUID wasn't present it would cause a SIGILL.


Kit

On 11/10/2023 01:47, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well if 
possible and added an IFDEFed distinction of 32-bits versus 64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas





On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:

Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), 
but I had reduced it to 10,000 because, coupled with the 1,000 
iterations in the subroutines themselves, would have led to 
1,000,000,000 passes and hence would take in the region of five to 
ten minutes to complete for a 16 MHz 386, for example.  Rika's 
suggestion of running as many iterations as needed until, say, 5 
seconds elapses, would help but the timing measurements would 
cause a lot of latency and will be imprecise on very slow 
routines.  Still, let's see if 100,000 gives better results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call


Indeed, it's much more consistent now, attached a new log for both 
32-bit and 64-bit versions from the Intel machine with Windows. 
Apparently, ADD is still somewhat faster on such "newer" Intel 
machines (at least if not considering the potential parallelism of 
LEA discussed previously). I can try this version on my AMD machines 
later tonight if considered useful - please, let me know which 
results would be relevant for you in that case (out of the ancient 
AMD DX4, only slightly less ancient AMD Athlon 1 GHz and the still 
rather reasonable AMD A9).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread Tomas Hajny via fpc-devel

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:

I'm all for receiving results for all kinds of processor, as it helps
me to make more informed choices on flags as well as confirming that
Agner Fog''s instruction tables are correct. Also, results for older
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default
"Athlon" option lines up with this.  Of the Intel architectures, the
speed slows down on COREAVX onwards (COREI is fine), so I added a new
COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
the point where LEA is fast again (its 16-bit version is also fast,
unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.


Alright, fine (I modified your test to include the CPU name as well if 
possible and added an IFDEFed distinction of 32-bits versus 64-bits):


32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-
   Pascal control case: 0.85 ns/call
 Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call


32-bits:
CPU = AMD Athlon(tm) Processor
--
   Pascal control case: 6.10 ns/call
 Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call


32-bits:
(AMD DX4 100 MHz - no CPUID name)
   Pascal control case: 123 ns/call
 Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas





On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:

Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but 
I had reduced it to 10,000 because, coupled with the 1,000 
iterations in the subroutines themselves, would have led to 
1,000,000,000 passes and hence would take in the region of five to 
ten minutes to complete for a 16 MHz 386, for example.  Rika's 
suggestion of running as many iterations as needed until, say, 5 
seconds elapses, would help but the timing measurements would cause 
a lot of latency and will be imprecise on very slow routines.  
Still, let's see if 100,000 gives better results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call


Indeed, it's much more consistent now, attached a new log for both 
32-bit and 64-bit versions from the Intel machine with Windows. 
Apparently, ADD is still somewhat faster on such "newer" Intel 
machines (at least if not considering the potential parallelism of LEA 
discussed previously). I can try this version on my AMD machines later 
tonight if considered useful - please, let me know which results would 
be relevant for you in that case (out of the ancient AMD DX4, only 
slightly less ancient AMD Athlon 1 GHz and the still rather reasonable 
AMD A9).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel
I'm all for receiving results for all kinds of processor, as it helps me 
to make more informed choices on flags as well as confirming that Agner 
Fog''s instruction tables are correct. Also, results for older 
processors can be hard to come by sometimes.


Currently, most architectures have a fast LEA, and the default "Athlon" 
option lines up with this.  Of the Intel architectures, the speed slows 
down on COREAVX onwards (COREI is fine), so I added a new COREX (for 
10th generation Core) option between ZEN2 and ZEN3 to mark the point 
where LEA is fast again (its 16-bit version is also fast, unlike Zen 3).


In the meantime I'll be looking at the benchmarking code that Stefan 
provided to see if it can and should be integrated.


Thanks again everyone for the results you're giving.

Kit

P.S. In regards to parallelisation in having LEA instructions running 
alongside other arithmetic/logical operations, that will be an 
interesting field of research.  At the very least, the post-peephole 
stage can change ADD or SUB into a LEA if using an AGU over an ALU 
appears to give a micro-optimisation.  It also benefits hyperthreading, 
as the ALUs tend to be very heavily used, while AGUs tend to be used one 
at a time.


On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:

On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:

Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but 
I had reduced it to 10,000 because, coupled with the 1,000 
iterations in the subroutines themselves, would have led to 
1,000,000,000 passes and hence would take in the region of five to 
ten minutes to complete for a 16 MHz 386, for example.  Rika's 
suggestion of running as many iterations as needed until, say, 5 
seconds elapses, would help but the timing measurements would cause 
a lot of latency and will be imprecise on very slow routines.  
Still, let's see if 100,000 gives better results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call


Indeed, it's much more consistent now, attached a new log for both 
32-bit and 64-bit versions from the Intel machine with Windows. 
Apparently, ADD is still somewhat faster on such "newer" Intel 
machines (at least if not considering the potential parallelism of LEA 
discussed previously). I can try this version on my AMD machines later 
tonight if considered useful - please, let me know which results would 
be relevant for you in that case (out of the ancient AMD DX4, only 
slightly less ancient AMD Athlon 1 GHz and the still rather reasonable 
AMD A9).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread Tomas Hajny via fpc-devel

On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:

Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I 
had reduced it to 10,000 because, coupled with the 1,000 iterations in 
the subroutines themselves, would have led to 1,000,000,000 passes and 
hence would take in the region of five to ten minutes to complete for 
a 16 MHz 386, for example.  Rika's suggestion of running as many 
iterations as needed until, say, 5 seconds elapses, would help but the 
timing measurements would cause a lot of latency and will be imprecise 
on very slow routines.  Still, let's see if 100,000 gives better 
results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call


Indeed, it's much more consistent now, attached a new log for both 
32-bit and 64-bit versions from the Intel machine with Windows. 
Apparently, ADD is still somewhat faster on such "newer" Intel machines 
(at least if not considering the potential parallelism of LEA discussed 
previously). I can try this version on my AMD machines later tonight if 
considered useful - please, let me know which results would be relevant 
for you in that case (out of the ancient AMD DX4, only slightly less 
ancient AMD Athlon 1 GHz and the still rather reasonable AMD A9).


Tomas
32-bit version, 10 runs in a row using a command shell for cycle:

   Pascal control case: 0.85 ns/call
 Using LEA instruction: 1.11 ns/call
Using ADD instructions: 0.74 ns/call
   Pascal control case: 0.95 ns/call
 Using LEA instruction: 0.95 ns/call
Using ADD instructions: 0.81 ns/call
   Pascal control case: 0.91 ns/call
 Using LEA instruction: 0.98 ns/call
Using ADD instructions: 0.83 ns/call
   Pascal control case: 0.90 ns/call
 Using LEA instruction: 1.12 ns/call
Using ADD instructions: 0.78 ns/call
   Pascal control case: 0.87 ns/call
 Using LEA instruction: 1.03 ns/call
Using ADD instructions: 0.71 ns/call
   Pascal control case: 0.87 ns/call
 Using LEA instruction: 1.03 ns/call
Using ADD instructions: 0.79 ns/call
   Pascal control case: 0.81 ns/call
 Using LEA instruction: 1.20 ns/call
Using ADD instructions: 0.92 ns/call
   Pascal control case: 0.97 ns/call
 Using LEA instruction: 1.01 ns/call
Using ADD instructions: 0.74 ns/call
   Pascal control case: 0.92 ns/call
 Using LEA instruction: 0.99 ns/call
Using ADD instructions: 0.81 ns/call
   Pascal control case: 0.90 ns/call
 Using LEA instruction: 1.00 ns/call
Using ADD instructions: 0.77 ns/call


64-bit version, 10 runs in a row using a command shell for cycle:

CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 1.04 ns/call
 Using LEA instruction: 1.09 ns/call
Using ADD instructions: 0.82 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 1.07 ns/call
 Using LEA instruction: 1.07 ns/call
Using ADD instructions: 0.71 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 0.98 ns/call
 Using LEA instruction: 1.07 ns/call
Using ADD instructions: 0.80 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 1.11 ns/call
 Using LEA instruction: 1.09 ns/call
Using ADD instructions: 0.75 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 0.98 ns/call
 Using LEA instruction: 1.02 ns/call
Using ADD instructions: 0.78 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 1.09 ns/call
 Using LEA instruction: 1.13 ns/call
Using ADD instructions: 0.69 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 0.98 ns/call
 Using LEA instruction: 1.11 ns/call
Using ADD instructions: 0.81 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 0.95 ns/call
 Using LEA instruction: 1.07 ns/call
Using ADD instructions: 0.71 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 1.04 ns/call
 Using LEA instruction: 1.01 ns/call
Using ADD instructions: 0.70 ns/call
CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 1.05 ns/call
 Using LEA instruction: 0.99 ns/call
Using ADD instructions: 0.71 ns/call
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread Marco van de Voort via fpc-devel


Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I 
had reduced it to 10,000 because, coupled with the 1,000 iterations in 
the subroutines themselves, would have led to 1,000,000,000 passes and 
hence would take in the region of five to ten minutes to complete for 
a 16 MHz 386, for example.  Rika's suggestion of running as many 
iterations as needed until, say, 5 seconds elapses, would help but the 
timing measurements would cause a lot of latency and will be imprecise 
on very slow routines.  Still, let's see if 100,000 gives better 
results for you.



I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)

   Pascal control case: 0.7 ns/call
 Using LEA instruction: 0.4 ns/call
Using ADD instructions: 0.7 ns/call

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel

Ooo, that might be just what we need.  Thank you Stefan.

Kit

On 10/10/2023 10:57, Stefan Glienke via fpc-devel wrote:

Be my guest making https://github.com/spring4d/benchmark compatible for all 
platforms you need it for.


On 10/10/2023 11:13 CEST J. Gareth Moreton via fpc-devel 
 wrote:

  
Thanks Tomas,


Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I
had reduced it to 10,000 because, coupled with the 1,000 iterations in
the subroutines themselves, would have led to 1,000,000,000 passes and
hence would take in the region of five to ten minutes to complete for a
16 MHz 386, for example.  Rika's suggestion of running as many
iterations as needed until, say, 5 seconds elapses, would help but the
timing measurements would cause a lot of latency and will be imprecise
on very slow routines.  Still, let's see if 100,000 gives better results
for you.

Kit

On 10/10/2023 09:57, Tomas Hajny wrote:

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors. I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.

I don't know what's broken, but the results vary so much on a fast
machine that they are unusable for any measurement from my point of
view (standard 3.2.2 compiler, compiled with -O4 and running under MS
Windows this time). Sometimes the ADD version shows 0.0 ns/call,
sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call
(64-bits). See the attached results (the CPU is only displayed for the
64-bit compilation, but it's obviously the same CPU).

Tomas



On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:

Thank you very much!  That processor is built on the Excavator
architecture and lines up with the flag I put in the merge request
(i.e. it has the "fast LEA" hint).

I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the
CPU automatically.

On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
    Pascal control case: 5.1 ns/call
  Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___

fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread Stefan Glienke via fpc-devel
Be my guest making https://github.com/spring4d/benchmark compatible for all 
platforms you need it for.

> On 10/10/2023 11:13 CEST J. Gareth Moreton via fpc-devel 
>  wrote:
> 
>  
> Thanks Tomas,
> 
> Nothing is broken, but the timing measurement isn't precise enough.
> 
> Normally I have a much higher iteration count (e.g. 1,000,000), but I 
> had reduced it to 10,000 because, coupled with the 1,000 iterations in 
> the subroutines themselves, would have led to 1,000,000,000 passes and 
> hence would take in the region of five to ten minutes to complete for a 
> 16 MHz 386, for example.  Rika's suggestion of running as many 
> iterations as needed until, say, 5 seconds elapses, would help but the 
> timing measurements would cause a lot of latency and will be imprecise 
> on very slow routines.  Still, let's see if 100,000 gives better results 
> for you.
> 
> Kit
> 
> On 10/10/2023 09:57, Tomas Hajny wrote:
> > On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:
> >
> >
> > Hi Kit,
> >
> >> I updated the "blea" test in the merge request so it now displays the
> >> processor brand name on x86_64; however, it is not fetched under i386
> >> because CPUID was not introduced until later 486 processors. I've
> >> attached it to this e-mail if anyone wants to take a look to ensure I
> >> haven't broken something.
> >
> > I don't know what's broken, but the results vary so much on a fast 
> > machine that they are unusable for any measurement from my point of 
> > view (standard 3.2.2 compiler, compiled with -O4 and running under MS 
> > Windows this time). Sometimes the ADD version shows 0.0 ns/call, 
> > sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call 
> > (64-bits). See the attached results (the CPU is only displayed for the 
> > 64-bit compilation, but it's obviously the same CPU).
> >
> > Tomas
> >
> >
> >>
> >> On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
> >>> Thank you very much!  That processor is built on the Excavator 
> >>> architecture and lines up with the flag I put in the merge request 
> >>> (i.e. it has the "fast LEA" hint).
> >>>
> >>> I honestly didn't expect this much testing feedback, so thank you all!
> >>>
> >>> Gareth aka. Kit
> >>>
> >>> P.S. I'm tempted to extend the test slightly to actually name the 
> >>> CPU automatically.
> >>>
> >>> On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:
>  My results:
>  jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
>  model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
>  jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
>  Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
>  Copyright (c) 1993-2021 by Florian Klaempfl and others
>  Target OS: Linux for x86-64
>  Compiling blea.pp
>  Linking blea
>  95 lines compiled, 0.2 sec
>  jean@First-Boss:~/temp$ ./blea
>     Pascal control case: 5.1 ns/call
>   Using LEA instruction: 0.5 ns/call
>  Using ADD instructions: 0.8 ns/call
>  jean@First-Boss:~/temp$
> 
>  ___
>  fpc-devel maillist  -  fpc-devel@lists.freepascal.org
>  https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 
> >>> ___
> >>> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> >>>
> >> ___
> >> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> >> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel
Looking at the text log, the results are a bit strange and I can't 
easily explain it.  Normally a system interrupt would increase the time 
taken.


Let me know if increasing the iteration count fixes it or not.

Kit

On 10/10/2023 09:57, Tomas Hajny wrote:

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors. I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.


I don't know what's broken, but the results vary so much on a fast 
machine that they are unusable for any measurement from my point of 
view (standard 3.2.2 compiler, compiled with -O4 and running under MS 
Windows this time). Sometimes the ADD version shows 0.0 ns/call, 
sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call 
(64-bits). See the attached results (the CPU is only displayed for the 
64-bit compilation, but it's obviously the same CPU).


Tomas




On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the 
CPU automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread J. Gareth Moreton via fpc-devel

Thanks Tomas,

Nothing is broken, but the timing measurement isn't precise enough.

Normally I have a much higher iteration count (e.g. 1,000,000), but I 
had reduced it to 10,000 because, coupled with the 1,000 iterations in 
the subroutines themselves, would have led to 1,000,000,000 passes and 
hence would take in the region of five to ten minutes to complete for a 
16 MHz 386, for example.  Rika's suggestion of running as many 
iterations as needed until, say, 5 seconds elapses, would help but the 
timing measurements would cause a lot of latency and will be imprecise 
on very slow routines.  Still, let's see if 100,000 gives better results 
for you.


Kit

On 10/10/2023 09:57, Tomas Hajny wrote:

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors. I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.


I don't know what's broken, but the results vary so much on a fast 
machine that they are unusable for any measurement from my point of 
view (standard 3.2.2 compiler, compiled with -O4 and running under MS 
Windows this time). Sometimes the ADD version shows 0.0 ns/call, 
sometimes the LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call 
(64-bits). See the attached results (the CPU is only displayed for the 
64-bit compilation, but it's obviously the same CPU).


Tomas




On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the 
CPU automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef CPUX86_64}
function FillBrandName: Boolean; assembler; nostackframe;
asm
  PUSH RBX
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
  LEA  R8,  [RIP + CPUName]
  MOV  EAX, $8002
  CPUID
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
  MOV  EAX, $8003
  CPUID
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
  MOV  EAX, $8004
  CPUID
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
  POP  RBX
end;
{$else CPUX86_64}
function FillBrandName: Boolean; inline;
begin
  Result := False;
end;
{$endif CPUX86_64}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  

Re: [fpc-devel] LEA instruction speed

2023-10-10 Thread Tomas Hajny via fpc-devel

On 2023-10-09 20:51, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


I updated the "blea" test in the merge request so it now displays the
processor brand name on x86_64; however, it is not fetched under i386
because CPUID was not introduced until later 486 processors.  I've
attached it to this e-mail if anyone wants to take a look to ensure I
haven't broken something.


I don't know what's broken, but the results vary so much on a fast 
machine that they are unusable for any measurement from my point of view 
(standard 3.2.2 compiler, compiled with -O4 and running under MS Windows 
this time). Sometimes the ADD version shows 0.0 ns/call, sometimes the 
LEA version shows 0.0 ns/call (32-bits) or 0.1 ns/call (64-bits). See 
the attached results (the CPU is only displayed for the 64-bit 
compilation, but it's obviously the same CPU).


Tomas




On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the CPU 
automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
32-bit version, 10 runs in a row using a command shell for cycle, LEA before 
ADD (original version):

   Pascal control case: 0.9 ns/call
 Using LEA instruction: 0.0 ns/call
Using ADD instructions: 1.6 ns/call
   Pascal control case: 0.4 ns/call
 Using LEA instruction: 1.5 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.1 ns/call
 Using LEA instruction: 1.6 ns/call
Using ADD instructions: 1.2 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.8 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.0 ns/call
Using ADD instructions: 1.6 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.6 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.8 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.1 ns/call
 Using LEA instruction: 1.6 ns/call
Using ADD instructions: 0.8 ns/call
   Pascal control case: 1.1 ns/call
 Using LEA instruction: 0.1 ns/call
Using ADD instructions: 1.6 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.5 ns/call
Using ADD instructions: 0.0 ns/call


32-bit version, 10 runs in a row using a command shell for cycle, LEA before 
ADD (original version):

   Pascal control case: 0.9 ns/call
 Using LEA instruction: 0.0 ns/call
Using ADD instructions: 1.6 ns/call
   Pascal control case: 0.4 ns/call
 Using LEA instruction: 1.5 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.1 ns/call
 Using LEA instruction: 1.6 ns/call
Using ADD instructions: 1.2 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.8 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.0 ns/call
Using ADD instructions: 1.6 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.6 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.8 ns/call
Using ADD instructions: 0.0 ns/call
   Pascal control case: 0.1 ns/call
 Using LEA instruction: 1.6 ns/call
Using ADD instructions: 0.8 ns/call
   Pascal control case: 1.1 ns/call
 Using LEA instruction: 0.1 ns/call
Using ADD instructions: 1.6 ns/call
   Pascal control case: 0.2 ns/call
 Using LEA instruction: 1.5 ns/call
Using ADD instructions: 0.0 ns/call


64-bit version, 10 runs in a row using a command shell for cycle, LEA before 
ADD (original version):

CPU = Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
---
   Pascal control case: 0.6 ns/call
 Using LEA instruction: 

Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread Jean SUZINEAU via fpc-devel

My results on Windows :

E:\temp>C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.exe -MObjFPC -Scghi 
-O1 -g -gl -l -vewnhibq -Fu. -FUlib\x86_64-win64 -FE. -oblea.exe blea.pp
Hint: (11030) Start of reading config file 
C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.cfg
Hint: (11031) End of reading config file 
C:\lazarus\fpc\3.2.2\bin\x86_64-win64\fpc.cfg

Free Pascal Compiler version 3.2.2 [2022/09/24] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
(1002) Target OS: Win64 for x64
(3104) Compiling blea.pp
(9015) Linking .\blea.exe
(1008) 150 lines compiled, 0.2 sec, 87840 bytes code, 5364 bytes data
(1022) 2 hint(s) issued

E:\temp>blea
CPU = Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
-
   Pascal control case: 1.9 ns/call
 Using LEA instruction: 1.2 ns/call
Using ADD instructions: 0.9 ns/call

E:\temp>


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread J. Gareth Moreton via fpc-devel
I updated the "blea" test in the merge request so it now displays the 
processor brand name on x86_64; however, it is not fetched under i386 
because CPUID was not introduced until later 486 processors.  I've 
attached it to this e-mail if anyone wants to take a look to ensure I 
haven't broken something.


Kit

On 09/10/2023 18:01, J. Gareth Moreton via fpc-devel wrote:
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request 
(i.e. it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the CPU 
automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;

var
  CPUName: array[0..48] of Char;

{$ifdef CPUX86_64}
function FillBrandName: Boolean; assembler; nostackframe;
asm
  PUSH RBX
  MOV  EAX, $8000
  CPUID
  CMP  EAX, $8004
  JB   @Unavailable
  LEA  R8,  [RIP + CPUName]
  MOV  EAX, $8002
  CPUID
  MOV  [R8], EAX
  MOV  [R8 + 4], EBX
  MOV  [R8 + 8], ECX
  MOV  [R8 + 12], EDX
  MOV  EAX, $8003
  CPUID
  MOV  [R8 + 16], EAX
  MOV  [R8 + 20], EBX
  MOV  [R8 + 24], ECX
  MOV  [R8 + 28], EDX
  MOV  EAX, $8004
  CPUID
  MOV  [R8 + 32], EAX
  MOV  [R8 + 36], EBX
  MOV  [R8 + 40], ECX
  MOV  [R8 + 44], EDX
  MOV  BYTE PTR [R8 + 48], 0
  MOV  AL,  1
  JMP  @ExitBrand
@Unavailable:
  XOR  AL,  AL
@ExitBrand:
  POP  RBX
end;
{$else CPUX86_64}
function FillBrandName: Boolean; inline;
begin
  Result := False;
end;
{$endif CPUX86_64}

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 1);
  time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode, X: Integer;
begin
  if FillBrandName then
begin
  WriteLn('CPU = ', CpuName);
  X := 0;
  while CpuName[X] <> #0 do
begin
  CpuName[X] := '-';
  Inc(X);
end;
  WriteLn('--', CpuName);
end;
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org

Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread J. Gareth Moreton via fpc-devel
Thank you very much!  That processor is built on the Excavator 
architecture and lines up with the flag I put in the merge request (i.e. 
it has the "fast LEA" hint).


I honestly didn't expect this much testing feedback, so thank you all!

Gareth aka. Kit

P.S. I'm tempted to extend the test slightly to actually name the CPU 
automatically.


On 09/10/2023 15:40, Jean SUZINEAU via fpc-devel wrote:

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread Jean SUZINEAU via fpc-devel

My results:
jean@First-Boss:~/temp$ cat /proc/cpuinfo | grep "model name"
model name    : AMD A6-7480 Radeon R5, 8 Compute Cores 2C+6G
jean@First-Boss:~/temp$ /usr/bin/fpc blea.pp
Free Pascal Compiler version 3.2.2 [2021/07/09] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling blea.pp
Linking blea
95 lines compiled, 0.2 sec
jean@First-Boss:~/temp$ ./blea
   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.8 ns/call
jean@First-Boss:~/temp$

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread J. Gareth Moreton via fpc-devel

Thank you for the report.

According to Agner Fog's table, complex LEA instructions should have a 
3-cycle latency on that architecture (Haswell). Optimisations with this 
instruction are proving interesting because there's such a variety 
between processor architectures. There are some that are fine with 3 
components, but slows right down if a scale factor is used.


Kit

On 09/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:

Hi Gareth

model name : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

Regards

Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sun, Oct 8, 2023 at 6:40 PM J. Gareth Moreton via fpc-devel 
 wrote:


Hi Nataraj

Which processor is that run on? (although too close to call, it
implies LEA has a latency of 2 in that case)

Kit

On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:

Hi

[nataraj@dflyHP ~]$ fpc ttt.pas
Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: DragonFly for x86-64
Compiling ttt.pas
Linking ttt
/usr/local/bin/ld.bfd: warning:
/usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o:
missing .note.GNU-stack section implies executable stack
/usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and
will be removed in a future version of the linker
121 lines compiled, 14.9 sec
[nataraj@dflyHP ~]$ ./ttt
   Pascal control case: 6.7 ns/call
 Using LEA instruction: 4.2 ns/call
Using ADD instructions: 4.0 ns/call


Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel
 wrote:

That's interesting; I am interested to see the assembly
output for the
Pascal control cases.  As for the 64-bit version, that was my
fault
since the assembly language is for Microsoft's ABI rather
than the
System V ABI, so it was checking a register with an undefined
value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

    Pascal control case: 2.0 ns/call
  Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
> On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>
>
> Hi Kit,
>
>> Do you think this should suffice? Originally it ran for
1,000,000
>> repetitions but I fear that will take way too long on a
486, so I
>> reduced it to 10,000.
>
> OK, I tried it now. First of all, after turning on the old
machine, I
> realized that it wasn't Intel but AMD 486 DX4 - sorry for
my bad
> memory. :-( I compiled and ran the test under OS/2 there (I
was too
> lazy to boot it to DOS ;-) ), but I assume that it
shouldn't make any
> substantial difference. The ADD and LEA results were
basically the
> same there, both around 100 ns / call. The Pascal result
was around
> twice as long. Interestingly, the Pascal result for FPC
3.2.2 was
> around 10% longer than the same source compiled with FPC
2.0.3 (the
> assembler versions were obviously the same for both FPC
versions; I
> tried compiling it also with FPC 1.0.10 and the assembler
versions
> were more than three times slower due to missing support
for the
> nostackframe directive).
>
> I tested it under the AMD Athlon 1 GHz machine as well and
again, the
> results for LEA and ADD are basically equal (both 3.1
ns/call) and the
> result for Pascal slightly more than twice (7.3 ns/call).
However,
> rather surprisingly for me, the overall test run was _much_
longer
> there?! Finally, I tried compiling the test on a 64-bit
machine (AMD
> A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3
compiled from
> a fresh 3.2 branch). The Pascal version shows about 4
ns/call, but the
> assembler version runs forever - well, certainly much
longer than my
> patience lasts. I haven't tried to analyze the reasons, but
that's
> what I get.
>
> Tomas
>
>
>
>>
>> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via
fpc-devel"
>>>  wrote:
>>>
>>>
>>> Hii Kit,
>>>
 This is mainly to Florian, but also to anyone else who
can answer
 the 

Re: [fpc-devel] LEA instruction speed

2023-10-09 Thread Nataraj S Narayan via fpc-devel
Hi Gareth

model name : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz

Regards

Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sun, Oct 8, 2023 at 6:40 PM J. Gareth Moreton via fpc-devel <
fpc-devel@lists.freepascal.org> wrote:

> Hi Nataraj
>
> Which processor is that run on? (although too close to call, it implies
> LEA has a latency of 2 in that case)
>
> Kit
> On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:
>
> Hi
>
> [nataraj@dflyHP ~]$ fpc ttt.pas
> Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64
> Copyright (c) 1993-2021 by Florian Klaempfl and others
> Target OS: DragonFly for x86-64
> Compiling ttt.pas
> Linking ttt
> /usr/local/bin/ld.bfd: warning:
> /usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing
> .note.GNU-stack section implies executable stack
> /usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be
> removed in a future version of the linker
> 121 lines compiled, 14.9 sec
> [nataraj@dflyHP ~]$ ./ttt
>Pascal control case: 6.7 ns/call
>  Using LEA instruction: 4.2 ns/call
> Using ADD instructions: 4.0 ns/call
>
>
> Nataraj S Narayan
> Synergy Info Systems
> Software & Technology Consultants
> Ettumanoor, INDIA
> Ph:+91 9443211326
>
>
> On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel <
> fpc-devel@lists.freepascal.org> wrote:
>
>> That's interesting; I am interested to see the assembly output for the
>> Pascal control cases.  As for the 64-bit version, that was my fault
>> since the assembly language is for Microsoft's ABI rather than the
>> System V ABI, so it was checking a register with an undefined value.
>> Find attached the fixed test.
>>
>> Kit
>>
>> P.S. Results on my Intel(R) Core(TM) i7-10750H
>>
>> Pascal control case: 2.0 ns/call
>>   Using LEA instruction: 1.7 ns/call
>> Using ADD instructions: 1.3 ns/call
>>
>> On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
>> > On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>> >
>> >
>> > Hi Kit,
>> >
>> >> Do you think this should suffice? Originally it ran for 1,000,000
>> >> repetitions but I fear that will take way too long on a 486, so I
>> >> reduced it to 10,000.
>> >
>> > OK, I tried it now. First of all, after turning on the old machine, I
>> > realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad
>> > memory. :-( I compiled and ran the test under OS/2 there (I was too
>> > lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any
>> > substantial difference. The ADD and LEA results were basically the
>> > same there, both around 100 ns / call. The Pascal result was around
>> > twice as long. Interestingly, the Pascal result for FPC 3.2.2 was
>> > around 10% longer than the same source compiled with FPC 2.0.3 (the
>> > assembler versions were obviously the same for both FPC versions; I
>> > tried compiling it also with FPC 1.0.10 and the assembler versions
>> > were more than three times slower due to missing support for the
>> > nostackframe directive).
>> >
>> > I tested it under the AMD Athlon 1 GHz machine as well and again, the
>> > results for LEA and ADD are basically equal (both 3.1 ns/call) and the
>> > result for Pascal slightly more than twice (7.3 ns/call). However,
>> > rather surprisingly for me, the overall test run was _much_ longer
>> > there?! Finally, I tried compiling the test on a 64-bit machine (AMD
>> > A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from
>> > a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the
>> > assembler version runs forever - well, certainly much longer than my
>> > patience lasts. I haven't tried to analyze the reasons, but that's
>> > what I get.
>> >
>> > Tomas
>> >
>> >
>> >
>> >>
>> >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>> >>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel"
>> >>>  wrote:
>> >>>
>> >>>
>> >>> Hii Kit,
>> >>>
>>  This is mainly to Florian, but also to anyone else who can answer
>>  the question - at which point did a complex LEA instruction (using
>>  all three input operands and some other specific circumstances) get
>>  slow? Preliminary research suggests the 486 was when it gained
>>  extra latency, and then Sandy Bridge when it got particularly bad.
>>  Icy Lake seems to be the architecture where faster LEA instructions
>>  are reintroduced, but I'm not sure about AMD processors.
>> >>> I cannot answer your question, but if you prepare a test program, I
>> >>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines
>> >>> if it helps you in any way (at least I hope the 486 DX2 machine
>> >>> should be still able to start ;-) ).
>> >>>
>> >>> Tomas
>> >>>
>> >>> ___
>> >>> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
>> >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>> >>>
>> >> 

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread Tomas Hajny via fpc-devel

On 2023-10-08 13:45, J. Gareth Moreton via fpc-devel wrote:



Sorry, ignore last attachment - I forgot to change a line of assembly
(it was correct for x86_64-win64!!). Here is the corrected version.


Alright, results for this version for AMD A9 9425 under Linux (the same 
trunk compiler as used yesterday):


64-bit version:

   Pascal control case: 5.1 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.9 ns/call


32-bit version:

   Pascal control case: 0.9 ns/call
 Using LEA instruction: 0.5 ns/call
Using ADD instructions: 0.9 ns/call

Tomas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel
Did some checking of the test I copied the code from, and I forgot that 
Rika's original code only exited once a certain time period had elapsed 
(e.g. 0.5 seconds).  I had changed it to a standard iteration count 
since I was concerned about fairness and accuracy, but I only changed 
the loop condition and nothing else.


Kit

On 08/10/2023 11:06, Marģers . via fpc-devel wrote:
1. why you leave "time:=..." in benchmark loop? It does add 50% of 
execution time per call.

2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, linux 
64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; 
assembler; nostackframe;

asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD 
instructions around LEA while results didn't chance, but only 2 around 
ADD.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel
In the meantime, here's the merge request for the feature based on user 
tests and studying of Agner Fog's instruction tables: 
https://gitlab.com/freepascal.org/fpc/source/-/merge_requests/502


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel

Hi Nataraj

Which processor is that run on? (although too close to call, it implies 
LEA has a latency of 2 in that case)


Kit

On 08/10/2023 14:06, Nataraj S Narayan via fpc-devel wrote:

Hi

[nataraj@dflyHP ~]$ fpc ttt.pas
Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: DragonFly for x86-64
Compiling ttt.pas
Linking ttt
/usr/local/bin/ld.bfd: warning: 
/usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing 
.note.GNU-stack section implies executable stack
/usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be 
removed in a future version of the linker

121 lines compiled, 14.9 sec
[nataraj@dflyHP ~]$ ./ttt
   Pascal control case: 6.7 ns/call
 Using LEA instruction: 4.2 ns/call
Using ADD instructions: 4.0 ns/call


Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel 
 wrote:


That's interesting; I am interested to see the assembly output for
the
Pascal control cases.  As for the 64-bit version, that was my fault
since the assembly language is for Microsoft's ABI rather than the
System V ABI, so it was checking a register with an undefined value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

    Pascal control case: 2.0 ns/call
  Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
> On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>
>
> Hi Kit,
>
>> Do you think this should suffice? Originally it ran for 1,000,000
>> repetitions but I fear that will take way too long on a 486, so I
>> reduced it to 10,000.
>
> OK, I tried it now. First of all, after turning on the old
machine, I
> realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad
> memory. :-( I compiled and ran the test under OS/2 there (I was too
> lazy to boot it to DOS ;-) ), but I assume that it shouldn't
make any
> substantial difference. The ADD and LEA results were basically the
> same there, both around 100 ns / call. The Pascal result was around
> twice as long. Interestingly, the Pascal result for FPC 3.2.2 was
> around 10% longer than the same source compiled with FPC 2.0.3 (the
> assembler versions were obviously the same for both FPC versions; I
> tried compiling it also with FPC 1.0.10 and the assembler versions
> were more than three times slower due to missing support for the
> nostackframe directive).
>
> I tested it under the AMD Athlon 1 GHz machine as well and
again, the
> results for LEA and ADD are basically equal (both 3.1 ns/call)
and the
> result for Pascal slightly more than twice (7.3 ns/call). However,
> rather surprisingly for me, the overall test run was _much_ longer
> there?! Finally, I tried compiling the test on a 64-bit machine
(AMD
> A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3
compiled from
> a fresh 3.2 branch). The Pascal version shows about 4 ns/call,
but the
> assembler version runs forever - well, certainly much longer
than my
> patience lasts. I haven't tried to analyze the reasons, but that's
> what I get.
>
> Tomas
>
>
>
>>
>> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via
fpc-devel"
>>>  wrote:
>>>
>>>
>>> Hii Kit,
>>>
 This is mainly to Florian, but also to anyone else who can
answer
 the question - at which point did a complex LEA instruction
(using
 all three input operands and some other specific
circumstances) get
 slow? Preliminary research suggests the 486 was when it gained
 extra latency, and then Sandy Bridge when it got particularly
bad.
 Icy Lake seems to be the architecture where faster LEA
instructions
 are reintroduced, but I'm not sure about AMD processors.
>>> I cannot answer your question, but if you prepare a test
program, I
>>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz
machines
>>> if it helps you in any way (at least I hope the 486 DX2 machine
>>> should be still able to start ;-) ).
>>>
>>> Tomas
>>>
>>> ___
>>> fpc-devel maillist  - fpc-devel@lists.freepascal.org
>>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>>
>> ___
>> fpc-devel maillist  - fpc-devel@lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> ___
> fpc-devel maillist  - 

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread Nataraj S Narayan via fpc-devel
Hi

[nataraj@dflyHP ~]$ fpc ttt.pas
Free Pascal Compiler version 3.2.2 [2023/07/04] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: DragonFly for x86-64
Compiling ttt.pas
Linking ttt
/usr/local/bin/ld.bfd: warning:
/usr/local/lib/fpc/3.2.2/units/x86_64-dragonfly/rtl/prt0.o: missing
.note.GNU-stack section implies executable stack
/usr/local/bin/ld.bfd: NOTE: This behaviour is deprecated and will be
removed in a future version of the linker
121 lines compiled, 14.9 sec
[nataraj@dflyHP ~]$ ./ttt
   Pascal control case: 6.7 ns/call
 Using LEA instruction: 4.2 ns/call
Using ADD instructions: 4.0 ns/call


Nataraj S Narayan
Synergy Info Systems
Software & Technology Consultants
Ettumanoor, INDIA
Ph:+91 9443211326


On Sat, Oct 7, 2023 at 9:39 PM J. Gareth Moreton via fpc-devel <
fpc-devel@lists.freepascal.org> wrote:

> That's interesting; I am interested to see the assembly output for the
> Pascal control cases.  As for the 64-bit version, that was my fault
> since the assembly language is for Microsoft's ABI rather than the
> System V ABI, so it was checking a register with an undefined value.
> Find attached the fixed test.
>
> Kit
>
> P.S. Results on my Intel(R) Core(TM) i7-10750H
>
> Pascal control case: 2.0 ns/call
>   Using LEA instruction: 1.7 ns/call
> Using ADD instructions: 1.3 ns/call
>
> On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
> > On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
> >
> >
> > Hi Kit,
> >
> >> Do you think this should suffice? Originally it ran for 1,000,000
> >> repetitions but I fear that will take way too long on a 486, so I
> >> reduced it to 10,000.
> >
> > OK, I tried it now. First of all, after turning on the old machine, I
> > realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad
> > memory. :-( I compiled and ran the test under OS/2 there (I was too
> > lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any
> > substantial difference. The ADD and LEA results were basically the
> > same there, both around 100 ns / call. The Pascal result was around
> > twice as long. Interestingly, the Pascal result for FPC 3.2.2 was
> > around 10% longer than the same source compiled with FPC 2.0.3 (the
> > assembler versions were obviously the same for both FPC versions; I
> > tried compiling it also with FPC 1.0.10 and the assembler versions
> > were more than three times slower due to missing support for the
> > nostackframe directive).
> >
> > I tested it under the AMD Athlon 1 GHz machine as well and again, the
> > results for LEA and ADD are basically equal (both 3.1 ns/call) and the
> > result for Pascal slightly more than twice (7.3 ns/call). However,
> > rather surprisingly for me, the overall test run was _much_ longer
> > there?! Finally, I tried compiling the test on a 64-bit machine (AMD
> > A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from
> > a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the
> > assembler version runs forever - well, certainly much longer than my
> > patience lasts. I haven't tried to analyze the reasons, but that's
> > what I get.
> >
> > Tomas
> >
> >
> >
> >>
> >> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
> >>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel"
> >>>  wrote:
> >>>
> >>>
> >>> Hii Kit,
> >>>
>  This is mainly to Florian, but also to anyone else who can answer
>  the question - at which point did a complex LEA instruction (using
>  all three input operands and some other specific circumstances) get
>  slow? Preliminary research suggests the 486 was when it gained
>  extra latency, and then Sandy Bridge when it got particularly bad.
>  Icy Lake seems to be the architecture where faster LEA instructions
>  are reintroduced, but I'm not sure about AMD processors.
> >>> I cannot answer your question, but if you prepare a test program, I
> >>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines
> >>> if it helps you in any way (at least I hope the 486 DX2 machine
> >>> should be still able to start ;-) ).
> >>>
> >>> Tomas
> >>>
> >>> ___
> >>> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> >>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> >>>
> >> ___
> >> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> >> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> > ___
> > fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> >___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org

Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel
Sorry, ignore last attachment - I forgot to change a line of assembly 
(it was correct for x86_64-win64!!). Here is the corrected version.


Kit

On 08/10/2023 12:38, J. Gareth Moreton via fpc-devel wrote:
Sorry, I got careless and was in a rush, as both the Pascal code is 
wrong and I didn't store the result of the benchmark test, hence the 
error check at the end returned a false negative.


The benchmark code was from Rika's SHA-1 test code, which I didn't 
properly check, although I assumed the logic was to avoid counting the 
time of the internal loop as much as possible.  I should have gone 
with my gut instinct and realised that wasn't the best method.


I've attached the updated test (now called "blea" as it's a benchmark 
test) with your suggestions implemented, and an improved benchmarking 
system.  I'm not used to specifying parameters in place of registers - 
I'm too used to needing total control!


Your results from experiments with adding additional ADD instructions 
is expected, as LEA uses an AGU for computation, leaving the ALUs free 
for other tasks (like ADD), so LEA is better even if speed is equal.


Kit

On 08/10/2023 11:06, Marģers . via fpc-devel wrote:
1. why you leave "time:=..." in benchmark loop? It does add 50% of 
execution time per call.

2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, 
linux 64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; 
assembler; nostackframe;

asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD 
instructions around LEA while results didn't chance, but only 2 
around ADD.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV Result, Input
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 1);
  time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread J. Gareth Moreton via fpc-devel
Sorry, I got careless and was in a rush, as both the Pascal code is 
wrong and I didn't store the result of the benchmark test, hence the 
error check at the end returned a false negative.


The benchmark code was from Rika's SHA-1 test code, which I didn't 
properly check, although I assumed the logic was to avoid counting the 
time of the internal loop as much as possible.  I should have gone with 
my gut instinct and realised that wasn't the best method.


I've attached the updated test (now called "blea" as it's a benchmark 
test) with your suggestions implemented, and an improved benchmarking 
system.  I'm not used to specifying parameters in place of registers - 
I'm too used to needing total control!


Your results from experiments with adding additional ADD instructions is 
expected, as LEA uses an AGU for computation, leaving the ALUs free for 
other tasks (like ADD), so LEA is better even if speed is equal.


Kit

On 08/10/2023 11:06, Marģers . via fpc-devel wrote:
1. why you leave "time:=..." in benchmark loop? It does add 50% of 
execution time per call.

2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, linux 
64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; 
assembler; nostackframe;

asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD 
instructions around LEA while results didn't chance, but only 2 around 
ADD.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel{ %CPU=i386,x86_64 }
program blea;

{$IF not defined(CPUX86) and not defined(CPUX86_64)}
  {$FATAL This test program requires an Intel x86 or x64 processor }
{$ENDIF}

{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := Result + X + $87654321;
  Result := Result xor Counter;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
  ADD Input, $87654321
  ADD Input, X
  XOR Input, Y
  DEC Y
  JNZ @Loop1
  MOV Result, Input
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, Y
  DEC Y
  JNZ @Loop2
  MOV EAX, ECX
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
Result := proc(Result, X, internal_reps);
  until (reps >= 1);
  time := ((Now - start) * SecsPerDay) / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-08 Thread Marģers . via fpc-devel
1. why you leave "time:=..." in benchmark loop? It does add 50% of execution time per call.
2. Pascal version does not match assembler version. Had to fix it.
  //Result := X + Counter + $87654321;
  Result:=Result + X + $87654321;
  Result:=Result xor y;
3. Assembler functions can be unified to work under win64,win32, linux 64, linux 32
function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe;
asm
@Loop2:
  LEA Input, [Input + X + $87654321]
  XOR Input, y
  DEC y
  JNZ @Loop2
  MOV EAX, Input
end;

4. My results. Ryzen 2700x

   Pascal control case: 0.7 ns/call  0.0710
 Using LEA instruction: 0.7 ns/call  0.0700
Using ADD instructions: 0.7 ns/call  0.0710

Even thou results are equal, i was able to add 4 independent ADD instructions around LEA while results didn't chance, but only 2 around ADD.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-07 Thread J. Gareth Moreton via fpc-devel
I'm still slightly curious, but if full optimisations make better code, 
then indeed it's probably not worth the effort.


Your timings are incredibly helpful - thank you!  If I understand, AMD 
A9 is the Excavator architecture, which implies that AMD processors 
don't suffer from the same latency with complex LEA instructions as 
Intel processors do.


Looking at Agner Fog's tables, it looks like the slow LEA instructions 
only came about at Sandy Bridge, which for Free Pascal I think lines up 
with COREAVX.  Even the Pentium-era processors have a 1-cycle LEA, and 
your testing on an AMD 486 shows it is at least as fast as two ADDs in a 
dependency chain. That should be all the information I need - thanks again!


Kit

On 07/10/2023 19:03, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote:

That's interesting; I am interested to see the assembly output for the
Pascal control cases.  As for the 64-bit version, that was my fault
since the assembly language is for Microsoft's ABI rather than the
System V ABI, so it was checking a register with an undefined value.
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

   Pascal control case: 2.0 ns/call
 Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call


OK. My results for the AMD A9 CPU mentioned previously and 32-bit 
trunk compiler (Linux) are:


   Pascal control case: 2.3 ns/call
 Using LEA instruction: 1.2 ns/call
Using ADD instructions: 1.5 ns/call


The same machine, the same operating environment, but a 64-bit trunk 
compiler:


   Pascal control case: 3.6 ns/call
 Using LEA instruction: 0.9 ns/call
Using ADD instructions: 1.3 ns/call


I tried compiling and running the test with all of FPC 2.0.4, 2.2.4, 
2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized that 
all results (for both the assembler and Pascal versions) compiled with 
anything older than 3.2.2 are an order of magnitude faster than with 
3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8 
ns/call with Pascal / 4 ns/call with assembler versions). This means 
that the comparison is obviously spoiled with something unrelated. 
Moreover, I noticed that when compiling with the highest level of 
optimizations, the Pascal version compiled for i386 is as fast or even 
little bit faster than the assembler version. I didn't do that 
previously, thus the longer time for the older compiler version 
probably isn't relevant. From this point of view, it probably doesn't 
make sense to spend time on comparing the generated code.


Tomas




On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.


OK, I tried it now. First of all, after turning on the old machine, 
I realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
memory. :-( I compiled and ran the test under OS/2 there (I was too 
lazy to boot it to DOS ;-) ), but I assume that it shouldn't make 
any substantial difference. The ADD and LEA results were basically 
the same there, both around 100 ns / call. The Pascal result was 
around twice as long. Interestingly, the Pascal result for FPC 3.2.2 
was around 10% longer than the same source compiled with FPC 2.0.3 
(the assembler versions were obviously the same for both FPC 
versions; I tried compiling it also with FPC 1.0.10 and the 
assembler versions were more than three times slower due to missing 
support for the nostackframe directive).


I tested it under the AMD Athlon 1 GHz machine as well and again, 
the results for LEA and ADD are basically equal (both 3.1 ns/call) 
and the result for Pascal slightly more than twice (7.3 ns/call). 
However, rather surprisingly for me, the overall test run was _much_ 
longer there?! Finally, I tried compiling the test on a 64-bit 
machine (AMD A9-9425) with Linux (compiled for 64-bits with FPC 
3.2.3 compiled from a fresh 3.2 branch). The Pascal version shows 
about 4 ns/call, but the assembler version runs forever - well, 
certainly much longer than my patience lasts. I haven't tried to 
analyze the reasons, but that's what I get.


Tomas





On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via 
fpc-devel"  wrote:



Hii Kit,

This is mainly to Florian, but also to anyone else who can answer 
the question - at which point did a complex LEA instruction 
(using all three input operands and some other specific 
circumstances) get slow? Preliminary research suggests the 486 
was when it gained extra latency, and then Sandy Bridge when it 
got particularly bad.  Icy Lake seems to be the architecture 
where faster LEA instructions are reintroduced, but I'm not sure 
about AMD processors.
I cannot answer your question, but if you prepare a test 

Re: [fpc-devel] LEA instruction speed

2023-10-07 Thread Tomas Hajny via fpc-devel

On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote:

That's interesting; I am interested to see the assembly output for the
Pascal control cases.  As for the 64-bit version, that was my fault
since the assembly language is for Microsoft's ABI rather than the
System V ABI, so it was checking a register with an undefined value. 
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

   Pascal control case: 2.0 ns/call
 Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call


OK. My results for the AMD A9 CPU mentioned previously and 32-bit trunk 
compiler (Linux) are:


   Pascal control case: 2.3 ns/call
 Using LEA instruction: 1.2 ns/call
Using ADD instructions: 1.5 ns/call


The same machine, the same operating environment, but a 64-bit trunk 
compiler:


   Pascal control case: 3.6 ns/call
 Using LEA instruction: 0.9 ns/call
Using ADD instructions: 1.3 ns/call


I tried compiling and running the test with all of FPC 2.0.4, 2.2.4, 
2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized that all 
results (for both the assembler and Pascal versions) compiled with 
anything older than 3.2.2 are an order of magnitude faster than with 
3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8 
ns/call with Pascal / 4 ns/call with assembler versions). This means 
that the comparison is obviously spoiled with something unrelated. 
Moreover, I noticed that when compiling with the highest level of 
optimizations, the Pascal version compiled for i386 is as fast or even 
little bit faster than the assembler version. I didn't do that 
previously, thus the longer time for the older compiler version probably 
isn't relevant. From this point of view, it probably doesn't make sense 
to spend time on comparing the generated code.


Tomas




On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.


OK, I tried it now. First of all, after turning on the old machine, I 
realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
memory. :-( I compiled and ran the test under OS/2 there (I was too 
lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any 
substantial difference. The ADD and LEA results were basically the 
same there, both around 100 ns / call. The Pascal result was around 
twice as long. Interestingly, the Pascal result for FPC 3.2.2 was 
around 10% longer than the same source compiled with FPC 2.0.3 (the 
assembler versions were obviously the same for both FPC versions; I 
tried compiling it also with FPC 1.0.10 and the assembler versions 
were more than three times slower due to missing support for the 
nostackframe directive).


I tested it under the AMD Athlon 1 GHz machine as well and again, the 
results for LEA and ADD are basically equal (both 3.1 ns/call) and the 
result for Pascal slightly more than twice (7.3 ns/call). However, 
rather surprisingly for me, the overall test run was _much_ longer 
there?! Finally, I tried compiling the test on a 64-bit machine (AMD 
A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from 
a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the 
assembler version runs forever - well, certainly much longer than my 
patience lasts. I haven't tried to analyze the reasons, but that's 
what I get.


Tomas





On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:



Hii Kit,

This is mainly to Florian, but also to anyone else who can answer 
the question - at which point did a complex LEA instruction (using 
all three input operands and some other specific circumstances) get 
slow? Preliminary research suggests the 486 was when it gained 
extra latency, and then Sandy Bridge when it got particularly bad.  
Icy Lake seems to be the architecture where faster LEA instructions 
are reintroduced, but I'm not sure about AMD processors.
I cannot answer your question, but if you prepare a test program, I 
can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines 
if it helps you in any way (at least I hope the 486 DX2 machine 
should be still able to start ;-) ).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  

Re: [fpc-devel] LEA instruction speed

2023-10-07 Thread J. Gareth Moreton via fpc-devel
That's interesting; I am interested to see the assembly output for the 
Pascal control cases.  As for the 64-bit version, that was my fault 
since the assembly language is for Microsoft's ABI rather than the 
System V ABI, so it was checking a register with an undefined value.  
Find attached the fixed test.


Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

   Pascal control case: 2.0 ns/call
 Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:

On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.


OK, I tried it now. First of all, after turning on the old machine, I 
realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
memory. :-( I compiled and ran the test under OS/2 there (I was too 
lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any 
substantial difference. The ADD and LEA results were basically the 
same there, both around 100 ns / call. The Pascal result was around 
twice as long. Interestingly, the Pascal result for FPC 3.2.2 was 
around 10% longer than the same source compiled with FPC 2.0.3 (the 
assembler versions were obviously the same for both FPC versions; I 
tried compiling it also with FPC 1.0.10 and the assembler versions 
were more than three times slower due to missing support for the 
nostackframe directive).


I tested it under the AMD Athlon 1 GHz machine as well and again, the 
results for LEA and ADD are basically equal (both 3.1 ns/call) and the 
result for Pascal slightly more than twice (7.3 ns/call). However, 
rather surprisingly for me, the overall test run was _much_ longer 
there?! Finally, I tried compiling the test on a 64-bit machine (AMD 
A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from 
a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the 
assembler version runs forever - well, certainly much longer than my 
patience lasts. I haven't tried to analyze the reasons, but that's 
what I get.


Tomas





On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:



Hii Kit,

This is mainly to Florian, but also to anyone else who can answer 
the question - at which point did a complex LEA instruction (using 
all three input operands and some other specific circumstances) get 
slow? Preliminary research suggests the 486 was when it gained 
extra latency, and then Sandy Bridge when it got particularly bad.  
Icy Lake seems to be the architecture where faster LEA instructions 
are reintroduced, but I'm not sure about AMD processors.
I cannot answer your question, but if you prepare a test program, I 
can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines 
if it helps you in any way (at least I hope the 486 DX2 machine 
should be still able to start ;-) ).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
program leatest;
{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := X + Counter + $87654321;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
{$ifdef CPUX86_64}
  {$ifdef MSWINDOWS}
  ADD ECX, $87654321
  ADD ECX, EDX
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop1
  MOV EAX, ECX
  {$else MSWINDOWS}
  ADD EDI, $87654321
  ADD EDI, ESI
  XOR EDI, EDX
  DEC EDX
  JNZ @Loop1
  MOV EAX, EDI
  {$endif MSWINDOWS}
{$else CPUX86_64}
  ADD EAX, $87654321
  ADD EAX, EDX
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop1
{$endif CPUX86_64}
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
{$ifdef CPUX86_64}
  {$ifdef MSWINDOWS}
  LEA ECX, [ECX + EDX + $87654321]
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop2
  MOV EAX, ECX
  {$else MSWINDOWS}
  LEA EDI, [EDI + ESI + $87654321]
  XOR EDI, EDX
  DEC EDX
  JNZ @Loop2
  MOV EAX, EDI
  {$endif MSWINDOWS}
{$else CPUX86_64}
  LEA EAX, [EAX + EDX + $87654321]
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop2
{$endif CPUX86_64}
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 

Re: [fpc-devel] LEA instruction speed

2023-10-07 Thread Tomas Hajny via fpc-devel

On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:


Hi Kit,


Do you think this should suffice? Originally it ran for 1,000,000
repetitions but I fear that will take way too long on a 486, so I
reduced it to 10,000.


OK, I tried it now. First of all, after turning on the old machine, I 
realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad memory. 
:-( I compiled and ran the test under OS/2 there (I was too lazy to boot 
it to DOS ;-) ), but I assume that it shouldn't make any substantial 
difference. The ADD and LEA results were basically the same there, both 
around 100 ns / call. The Pascal result was around twice as long. 
Interestingly, the Pascal result for FPC 3.2.2 was around 10% longer 
than the same source compiled with FPC 2.0.3 (the assembler versions 
were obviously the same for both FPC versions; I tried compiling it also 
with FPC 1.0.10 and the assembler versions were more than three times 
slower due to missing support for the nostackframe directive).


I tested it under the AMD Athlon 1 GHz machine as well and again, the 
results for LEA and ADD are basically equal (both 3.1 ns/call) and the 
result for Pascal slightly more than twice (7.3 ns/call). However, 
rather surprisingly for me, the overall test run was _much_ longer 
there?! Finally, I tried compiling the test on a 64-bit machine (AMD 
A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from a 
fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the 
assembler version runs forever - well, certainly much longer than my 
patience lasts. I haven't tried to analyze the reasons, but that's what 
I get.


Tomas





On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:



Hii Kit,

This is mainly to Florian, but also to anyone else who can answer the 
question - at which point did a complex LEA instruction (using all 
three input operands and some other specific circumstances) get 
slow?  Preliminary research suggests the 486 was when it gained extra 
latency, and then Sandy Bridge when it got particularly bad.  Icy 
Lake seems to be the architecture where faster LEA instructions are 
reintroduced, but I'm not sure about AMD processors.
I cannot answer your question, but if you prepare a test program, I 
can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines 
if it helps you in any way (at least I hope the 486 DX2 machine should 
be still able to start ;-) ).


Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-06 Thread J. Gareth Moreton via fpc-devel

Hi Tomas,

Do you think this should suffice? Originally it ran for 1,000,000 
repetitions but I fear that will take way too long on a 486, so I 
reduced it to 10,000.


Kit

On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:

On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:


Hii Kit,


This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?  Preliminary research suggests 
the 486 was when it gained extra latency, and then Sandy Bridge when it got 
particularly bad.  Icy Lake seems to be the architecture where faster LEA 
instructions are reintroduced, but I'm not sure about AMD processors.

I cannot answer your question, but if you prepare a test program, I can run it 
on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in 
any way (at least I hope the 486 DX2 machine should be still able to start ;-) 
).

Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
program leatest;
{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
begin
  Result := X + Counter + $87654321;
  Dec(Counter);
end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop1:
{$ifdef CPUX86_64}
  ADD ECX, $87654321
  ADD ECX, EDX
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop1
  MOV EAX, ECX
{$else CPUX86_64}
  ADD EAX, $87654321
  ADD EAX, EDX
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop1
{$endif CPUX86_64}
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; 
nostackframe;
asm
@Loop2:
{$ifdef CPUX86_64}
  LEA ECX, [ECX + EDX + $87654321]
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop2
  MOV EAX, ECX
{$else CPUX86_64}
  LEA EAX, [EAX + EDX + $87654321]
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop2
{$endif CPUX86_64}
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): 
LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
inc(reps);
proc(Result, X, internal_reps);
time := (Now - start) * SecsPerDay;
  until (reps >= 1);
  time := time / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 500, 
1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 500, 
1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 500, 
1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
begin
  WriteLn('ERROR: Checksum_LEA doesn''t match control case');
  FailureCode := FailureCode or 1;
end;
  if (Results[0] <> Results[2]) then
begin
  WriteLn('ERROR: Checksum_ADD doesn''t match control case');
  FailureCode := FailureCode or 2
end;

  if FailureCode <> 0 then
Halt(FailureCode);
end.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread J. Gareth Moreton via fpc-devel
What should I call a new sub-CPU option?  Should it be "ICELAKE" or is 
there a better name like "CORE10" or "COREX" (X being the Roman numeral 
for 10, standing in for the 10th generation of Intel Core)?


Kit

On 03/10/2023 08:02, Florian Klämpfl via fpc-devel wrote:



Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel 
:

Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?

Maybe check Agner’s list?


Preliminary research suggests the 486 was when it gained extra latency, and 
then Sandy Bridge when it got particularly bad.  Icy Lake seems to be the 
architecture where faster LEA instructions are reintroduced, but I'm not sure 
about AMD processors.

Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better 
name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)?

If it doesn’t fit in the existing ones, you can always add new ones


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread J. Gareth Moreton via fpc-devel
I don't think any of them currently fit, although Zen 3 is later than 
Ice Lake, but I'm not sure if it has a faster LEA or not. I'll do some 
investigation.  I'll take up Tomas' offer on the 486 test though.  
Personally I think the best test might actually be one of the 
recently-optimised cryptographic functions.


Kit

On 03/10/2023 08:02, Florian Klämpfl via fpc-devel wrote:



Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel 
:

Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?

Maybe check Agner’s list?


Preliminary research suggests the 486 was when it gained extra latency, and 
then Sandy Bridge when it got particularly bad.  Icy Lake seems to be the 
architecture where faster LEA instructions are reintroduced, but I'm not sure 
about AMD processors.

Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a better 
name or does it fall under one of our categories already (CORE_AVX2 or ZEN3)?

If it doesn’t fit in the existing ones, you can always add new ones


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread Florian Klämpfl via fpc-devel


> Am 03.10.2023 um 03:32 schrieb J. Gareth Moreton via fpc-devel 
> :
> 
> Hi everyone,
> 
> This is mainly to Florian, but also to anyone else who can answer the 
> question - at which point did a complex LEA instruction (using all three 
> input operands and some other specific circumstances) get slow? 

Maybe check Agner’s list?

> Preliminary research suggests the 486 was when it gained extra latency, and 
> then Sandy Bridge when it got particularly bad.  Icy Lake seems to be the 
> architecture where faster LEA instructions are reintroduced, but I'm not sure 
> about AMD processors.
> 
> Should I introduce a new x86 subprocessor named "ICYLAKE" or is there a 
> better name or does it fall under one of our categories already (CORE_AVX2 or 
> ZEN3)?

If it doesn’t fit in the existing ones, you can always add new ones

> 
> Kit
> 
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-03 Thread J. Gareth Moreton via fpc-devel

Hmmm, could be fun to attempt to test - I'll see what I can set up.

Kit

On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:

On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:


Hii Kit,


This is mainly to Florian, but also to anyone else who can answer the question 
- at which point did a complex LEA instruction (using all three input operands 
and some other specific circumstances) get slow?  Preliminary research suggests 
the 486 was when it gained extra latency, and then Sandy Bridge when it got 
particularly bad.  Icy Lake seems to be the architecture where faster LEA 
instructions are reintroduced, but I'm not sure about AMD processors.

I cannot answer your question, but if you prepare a test program, I can run it 
on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in 
any way (at least I hope the 486 DX2 machine should be still able to start ;-) 
).

Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-02 Thread Tomas Hajny via fpc-devel
On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
 wrote:


Hii Kit,

>This is mainly to Florian, but also to anyone else who can answer the question 
>- at which point did a complex LEA instruction (using all three input operands 
>and some other specific circumstances) get slow?  Preliminary research 
>suggests the 486 was when it gained extra latency, and then Sandy Bridge when 
>it got particularly bad.  Icy Lake seems to be the architecture where faster 
>LEA instructions are reintroduced, but I'm not sure about AMD processors.

I cannot answer your question, but if you prepare a test program, I can run it 
on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines if it helps you in 
any way (at least I hope the 486 DX2 machine should be still able to start ;-) 
).

Tomas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] LEA instruction speed

2023-10-02 Thread J. Gareth Moreton via fpc-devel

(And I meant "Ice Lake", not "Icy Lake")

On 03/10/2023 02:32, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

This is mainly to Florian, but also to anyone else who can answer the 
question - at which point did a complex LEA instruction (using all 
three input operands and some other specific circumstances) get slow?  
Preliminary research suggests the 486 was when it gained extra 
latency, and then Sandy Bridge when it got particularly bad.  Icy Lake 
seems to be the architecture where faster LEA instructions are 
reintroduced, but I'm not sure about AMD processors.


Should I introduce a new x86 subprocessor named "ICYLAKE" or is there 
a better name or does it fall under one of our categories already 
(CORE_AVX2 or ZEN3)?


Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel