Re: [fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Florian Klämpfl via fpc-devel Sat, 16 Apr 2022 12:00:56 -0700

> Am 16.04.2022 um 01:26 schrieb J. Gareth Moreton via fpc-devel 
> <fpc-devel@lists.freepascal.org>:
> 
> Hi everyone,
> 
> This is something that sprung to mind when thinking about code speed and the 
> like, and one thing that cropped up is the initialisation of large variables 
> such as arrays or records.  A common means of doing this is, say:
> 
> FillChar(MyVar, SizeOf(MyVar), 0);
> 
> To keep things as general-purpose as possible, this usually results in a 
> function call that decides the best course of action, and for very large 
> blocks of data whose size may not be deterministic (e.g. a file buffer), this 
> is the best approach - the overhead is relatively small and it quickly uses 
> fast block-move instructions.
> 
> However, for small-to-mid-sized variables of known size, this can lead to 
> some inefficiencies, first by not taking into account that the size of the 
> variable is known, but also because the initialisation value is zero, more 
> often that not, and the variable is probably aligned on the stack (so the 
> checks to make sure a pointer is aligned are unnecessary).
> 
> I did a proof of concept on x86_64-win64 with the following record:
> 
> type
>   TTestRecord = record
>     Field1: Byte;
>     Field2, Field3, Field4: Integer;
>   end;
> 
> SizeOf(TTestRecord) is 16 and all the fields are on 4-byte boundaries.  
> Nothing particularly special.
> 
> I then declared a variable of this time and filled the fields with random 
> values, and then ran two different methods to clear their memory.  To get a 
> good speed average, I ran each method 1,000,000,000 times in a for-loop.  The 
> first method was:
> 
> FillChar(TestRecord, SizeOf(TestRecord), 0);
> 
> The second method was inline assembly language (which I've called 'the 
> intrinsic'):
> 
> asm
>   PXOR   XMM0, XMM0
>   MOVDQU [RIP+TestRecord], XMM0
> end;2
> 
> It's not perfect because the presence of inline assembly prevents the use of 
> register variables (although TestRecord is always on the stack regardless), 
> but the performance hit is barely noticeable in this case, and if the 
> assembly language were inserted by the compiler, the register variable 
> problem won't arise.
> 
> These are my results:
> 
>  FillChar time: 2.398 ns
> 
> Field1 = 0
> Field2 = 0
> Field3 = 0
> Field4 = 0
> 
> Intrinsic time: 1.336 ns
> 
> Field1 = 0
> Field2 = 0
> Field3 = 0
> Field4 = 0
> 
> Sure, it's on the order of nanoseconds, but the intrinsic is almost twice as 
> fast.
> 
> In terms of size - FillChar call = 20 bytes:
> 
> 488d0d22080200           lea 0x20822(%rip),%rcx        # 0x100022010
> 4531c0                   xor    %r8d,%r8d
> ba10000000               mov    $0x10,%edx
> e8150a0000               callq  0x100002210 
> <SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
> 
> The intrinsic = 12 bytes:
> 
> 660fefc0                 pxor %xmm0,%xmm0
> f30f7f05bd050200         movdqu %xmm0,0x205bd(%rip)        # 0x100022010
> 
> For a 32-byte record instead, an extra 8-byte MOVDQU instruction would be 
> required, so the 2 would be equal size, but with the bonus that the intrinsic 
> doesn't have a function call and will probably help optimisation in the rest 
> of the procedure by freeing up the registers used to pass parameters (%rcx, 
> %rdx and %r8 in this case; although the intrinsic will require an MM register 
> in this x86_64 example, they tend to not be used as often).  Also, the 
> peephole optimizer can remove redundant PXOR XMM0, XMM0 calls, which will 
> help as well if there are multiple FillChar calls.
> 
> I'm not proposing a total rewrite, and I would say that in the default case, 
> it should just fall back to the in-built System functions, but the relevant 
> compiler nodes could be overridden on specific platforms to generate smaller, 
> more optimised code when the sizes and values are known at compile time.
> 
> Now, in this example, it is still faster to simply set the fields manually 
> one-by-one (clocks in at around 1.2 ns), possibly due to the unaligned write 
> (MOVDQU) and internal SSE state switching adding some overhead, but there's 
> nothing to stop the compiler from inserting code in place of the FillChar 
> call to do just that if it thinks it's the fastest method.  Then again, one 
> has to be a little bit careful because FillChar and the intrinsic will also 
> set the filler bytes between Field1 and Field2 to 0, whereas manually 
> assigning 0 to the fields won't (so they aren't strictly equivalent and might 
> only be allowed if there are no filler bytes or when compiling under -O4, but 
> the latter may still be dangerous when typecasting is concerned), and extra 
> care would have to be taken when unions are concerned (sorry, 'union' that's 
> a C term - what's the official Pascal term again?).
> 
> Actual Pascal calls to FillChar would not change in any way and so 
> theoretically it won't break existing code.  The only drawback is that the 
> intrinsic and the internal System functions would have to be named the same 
> so constructs such as "FuncPtr := @FillChar;" as well as calling FillChar 
> from assembler routines stilll work, and the compiler would have to know how 
> to differentiate between the two.
> 
> Just on the surface, what are your thoughts?

Inlining FillChar is for sure useful (same for move). The FillChar in the 
system unit could stay, the compiler could just replace a call to 
System.FillChar by some compiler generated assembler doing the FillChar.

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Reply via email to