Re: [fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Benito van der Zander via fpc-devel Sat, 16 Apr 2022 06:43:15 -0700

Hi,

it could always inline it.

For small sizes do that mov and for large sizes do rep stosb on x86. Itis very fast nowadays. Faster than FillChar on my Intel laptop. (exceptfor mid sizes like 128 bytes)



Bye,
Benito
On 16.04.22 01:26, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,
This is something that sprung to mind when thinking about code speedand the like, and one thing that cropped up is the initialisation oflarge variables such as arrays or records. A common means of doingthis is, say:
FillChar(MyVar, SizeOf(MyVar), 0);
To keep things as general-purpose as possible, this usually results ina function call that decides the best course of action, and for verylarge blocks of data whose size may not be deterministic (e.g. a filebuffer), this is the best approach - the overhead is relatively smalland it quickly uses fast block-move instructions.
However, for small-to-mid-sized variables of known size, this can leadto some inefficiencies, first by not taking into account that the sizeof the variable is known, but also because the initialisation value iszero, more often that not, and the variable is probably aligned on thestack (so the checks to make sure a pointer is aligned are unnecessary).
I did a proof of concept on x86_64-win64 with the following record:

type
  TTestRecord = record
    Field1: Byte;
    Field2, Field3, Field4: Integer;
  end;
SizeOf(TTestRecord) is 16 and all the fields are on 4-byteboundaries. Nothing particularly special.
I then declared a variable of this time and filled the fields withrandom values, and then ran two different methods to clear theirmemory. To get a good speed average, I ran each method 1,000,000,000times in a for-loop. The first method was:
FillChar(TestRecord, SizeOf(TestRecord), 0);
The second method was inline assembly language (which I've called 'theintrinsic'):
asm
  PXOR   XMM0, XMM0
  MOVDQU [RIP+TestRecord], XMM0
end;2
It's not perfect because the presence of inline assembly prevents theuse of register variables (although TestRecord is always on the stackregardless), but the performance hit is barely noticeable in thiscase, and if the assembly language were inserted by the compiler, theregister variable problem won't arise.
These are my results:

 FillChar time: 2.398 ns

Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0

Intrinsic time: 1.336 ns

Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0
Sure, it's on the order of nanoseconds, but the intrinsic is almosttwice as fast.
In terms of size - FillChar call = 20 bytes:

488d0d22080200           lea 0x20822(%rip),%rcx        # 0x100022010
4531c0                   xor    %r8d,%r8d
ba10000000               mov    $0x10,%edx
e8150a0000 callq 0x100002210<SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
The intrinsic = 12 bytes:

660fefc0                 pxor %xmm0,%xmm0
f30f7f05bd050200         movdqu %xmm0,0x205bd(%rip)        # 0x100022010
For a 32-byte record instead, an extra 8-byte MOVDQU instruction wouldbe required, so the 2 would be equal size, but with the bonus that theintrinsic doesn't have a function call and will probably helpoptimisation in the rest of the procedure by freeing up the registersused to pass parameters (%rcx, %rdx and %r8 in this case; although theintrinsic will require an MM register in this x86_64 example, theytend to not be used as often). Also, the peephole optimizer canremove redundant PXOR XMM0, XMM0 calls, which will help as well ifthere are multiple FillChar calls.
I'm not proposing a total rewrite, and I would say that in the defaultcase, it should just fall back to the in-built System functions, butthe relevant compiler nodes could be overridden on specific platformsto generate smaller, more optimised code when the sizes and values areknown at compile time.
Now, in this example, it is still faster to simply set the fieldsmanually one-by-one (clocks in at around 1.2 ns), possibly due to theunaligned write (MOVDQU) and internal SSE state switching adding someoverhead, but there's nothing to stop the compiler from inserting codein place of the FillChar call to do just that if it thinks it's thefastest method. Then again, one has to be a little bit carefulbecause FillChar and the intrinsic will also set the filler bytesbetween Field1 and Field2 to 0, whereas manually assigning 0 to thefields won't (so they aren't strictly equivalent and might only beallowed if there are no filler bytes or when compiling under -O4, butthe latter may still be dangerous when typecasting is concerned), andextra care would have to be taken when unions are concerned (sorry,'union' that's a C term - what's the official Pascal term again?).
Actual Pascal calls to FillChar would not change in any way and sotheoretically it won't break existing code. The only drawback is thatthe intrinsic and the internal System functions would have to be namedthe same so constructs such as "FuncPtr := @FillChar;" as well ascalling FillChar from assembler routines stilll work, and the compilerwould have to know how to differentiate between the two.
Just on the surface, what are your thoughts?

Garetha ka. Kit

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Reply via email to