Ideally, you should specify 'vectorcall' either when interfacing with third-party libraries, when the code can be vectorised by the compiler, or when doing it yourself in assembly language.  For example, if I wanted to write the cmod function in x86_64 assembler (Intel notation):

function cmod(z: Complex): Double; vectorcall; assembler; nostackframe;
asm
  MULPD XMM0, XMM0
  HADDPD XMM0, XMM0
  SQRTSD XMM0, XMM0
end;

Without vectorcall (or an unaligned type), where each field would be in a separate register, the code would instead be:

function cmod(z: Complex): Double; assembler; nostackframe;
asm
  MULSD XMM0, XMM0
  MULSD XMM1, XMM1
  ADDSD XMM0, XMM1
  SQRTSD XMM0, XMM0
end;

Admittedly the advantages are more obvious when using arrays of Singles.  I guess a good example would be a 4-component dot product (I know there's a dot product instruction in SSE4, but I'm ignoring it for now):

type
  TVector4 = record
    x, y, z, w: Single;
  end align 16; { hey, I can dream! }

function DotProduct(V: TVector4): Single; vectorcall; assembler; nostackframe;
begin
  MULPS XMM0, XMM0
  HADDPS XMM0, XMM0
  HADDPS XMM0, XMM0
  { Only the first component of XMM0 is considered for the result }
end;

And without vectorcall (or an unaligned type):

function DotProduct(V: TVector4): Single; vectorcall; assembler; nostackframe;
begin
  MULSS XMM0, XMM0
  MULSS XMM1, XMM1
  MULSS XMM2, XMM2
  MULSS XMM3, XMM3
  ADDSS XMM0, XMM1
  ADDSS XMM0, XMM2
  ADDSS XMM0, XMM3
end;

It's hard to say which function is more efficient here due to the latency of HADDPS and the multiple logic ports available (usually you can do at least two independent vector multiplications simultaneously), but the overhead of moving each field to a separate register will definitely add up.  At the very least though, for the first dot product example, if the compiler was able to produce such assembler from Pascal source, it would be much more efficient to inline because it only uses a single register throughout.  I'm not sure how the compiler would know to inline a function when it's reached the assembler stage though, even if the registers are still virtual.

To get back to the subject at hand... the advantages of vectorcall.  Microsoft Visual C++ does have a compiler option where it automatically sets the calling convention to "vectorcall" rather than the default Microsoft calling convention (which is based off "fastcall"), since in most cases with integers, pointers and individual floating-point parameters, vectorcall doesn't behave any differently.  FPC would only be able to take full advantage of vectorcall and aligned types under Linux if the compiler was made better with vectorising instructions.

As a side-note, I would like to propose adding the "fastcall" calling convention for i386-win32 and x86_64-win64 (and maybe other i386 and x86_64 platforms).  Under Win32, fastcall uses ECX and EDX for its first two parameters and EAX for the result (it's a worse form of Pascal's default 'register' convention, but this was designed in the days when  C++ functions pushed all their parameters to the stack), while under Win64 it would be equivalent to 'ms_abi_default' and force the default Microsoft calling convention regardless of whether there was a setting to default to vectorcall (I consider the default calling convention to be based off fastcall because it uses RCX and RDX for its first two parameters, then adds R8 and R9 for the next two, and the XMM registers for floating-point arguments).  More than anything it would just help to interface with third-party libraries again.

Gareth aka. Kit

On 27/10/2019 08:02, Florian Klämpfl wrote:

Am 27.10.19 um 07:32 schrieb J. Gareth Moreton:
I guess you're right.  It just seems weird because the System V ABI was designed from the start to use the MM registers fully, so long as the data is aligned.  In effect, it had vectorcall wrapped into its design from the start. Granted, vectorcall has some advantages and can deal with relatively complex aggregates that the System V ABI cannot handle (for example, a record type that contains a normal vector and information relating to bump mapping).

I just hoped that making updates to uComplex, while ensuring existing Pascal code still compiles, would help take advantage of modern ABI designs.

Is there currently any example which shows that vectorcall has any advantage with FPC? Else I would propose first to make FPC able to take advantage of it and then talk about if we really add vectorcall. Currently I fear, FPC gets only into trouble when using vectorcall as it tries first to push everything into one xmm register and then splits this again in the callee.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to