Re: [fpc-devel] Question on updating FPC packages

J. Gareth Moreton Sun, 27 Oct 2019 03:53:21 -0700

Ideally, you should specify 'vectorcall' either when interfacing withthird-party libraries, when the code can be vectorised by the compiler,or when doing it yourself in assembly language. For example, if Iwanted to write the cmod function in x86_64 assembler (Intel notation):


function cmod(z: Complex): Double; vectorcall; assembler; nostackframe;
asm
  MULPD XMM0, XMM0
  HADDPD XMM0, XMM0
  SQRTSD XMM0, XMM0
end;

Without vectorcall (or an unaligned type), where each field would be ina separate register, the code would instead be:


function cmod(z: Complex): Double; assembler; nostackframe;
asm
  MULSD XMM0, XMM0
  MULSD XMM1, XMM1
  ADDSD XMM0, XMM1
  SQRTSD XMM0, XMM0
end;

Admittedly the advantages are more obvious when using arrays ofSingles. I guess a good example would be a 4-component dot product (Iknow there's a dot product instruction in SSE4, but I'm ignoring it fornow):


type
  TVector4 = record
    x, y, z, w: Single;
  end align 16; { hey, I can dream! }

function DotProduct(V: TVector4): Single; vectorcall; assembler;nostackframe;

begin
  MULPS XMM0, XMM0
  HADDPS XMM0, XMM0
  HADDPS XMM0, XMM0
  { Only the first component of XMM0 is considered for the result }
end;

And without vectorcall (or an unaligned type):

function DotProduct(V: TVector4): Single; vectorcall; assembler;nostackframe;

begin
  MULSS XMM0, XMM0
  MULSS XMM1, XMM1
  MULSS XMM2, XMM2
  MULSS XMM3, XMM3
  ADDSS XMM0, XMM1
  ADDSS XMM0, XMM2
  ADDSS XMM0, XMM3
end;

It's hard to say which function is more efficient here due to thelatency of HADDPS and the multiple logic ports available (usually youcan do at least two independent vector multiplications simultaneously),but the overhead of moving each field to a separate register willdefinitely add up. At the very least though, for the first dot productexample, if the compiler was able to produce such assembler from Pascalsource, it would be much more efficient to inline because it only uses asingle register throughout. I'm not sure how the compiler would know toinline a function when it's reached the assembler stage though, even ifthe registers are still virtual.

To get back to the subject at hand... the advantages of vectorcall. Microsoft Visual C++ does have a compiler option where it automaticallysets the calling convention to "vectorcall" rather than the defaultMicrosoft calling convention (which is based off "fastcall"), since inmost cases with integers, pointers and individual floating-pointparameters, vectorcall doesn't behave any differently. FPC would onlybe able to take full advantage of vectorcall and aligned types underLinux if the compiler was made better with vectorising instructions.

As a side-note, I would like to propose adding the "fastcall" callingconvention for i386-win32 and x86_64-win64 (and maybe other i386 andx86_64 platforms). Under Win32, fastcall uses ECX and EDX for its firsttwo parameters and EAX for the result (it's a worse form of Pascal'sdefault 'register' convention, but this was designed in the days when C++ functions pushed all their parameters to the stack), while underWin64 it would be equivalent to 'ms_abi_default' and force the defaultMicrosoft calling convention regardless of whether there was a settingto default to vectorcall (I consider the default calling convention tobe based off fastcall because it uses RCX and RDX for its first twoparameters, then adds R8 and R9 for the next two, and the XMM registersfor floating-point arguments). More than anything it would just help tointerface with third-party libraries again.


Gareth aka. Kit

On 27/10/2019 08:02, Florian Klämpfl wrote:

Am 27.10.19 um 07:32 schrieb J. Gareth Moreton:
I guess you're right. It just seems weird because the System V ABIwas designed from the start to use the MM registers fully, so long asthe data is aligned. In effect, it had vectorcall wrapped into itsdesign from the start. Granted, vectorcall has some advantages andcan deal with relatively complex aggregates that the System V ABIcannot handle (for example, a record type that contains a normalvector and information relating to bump mapping).
I just hoped that making updates to uComplex, while ensuring existingPascal code still compiles, would help take advantage of modern ABIdesigns.
Is there currently any example which shows that vectorcall has anyadvantage with FPC? Else I would propose first to make FPC able totake advantage of it and then talk about if we really add vectorcall.Currently I fear, FPC gets only into trouble when using vectorcall asit tries first to push everything into one xmm register and thensplits this again in the callee.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Question on updating FPC packages

Reply via email to