Hi,
it could always inline it.
For small sizes do that mov and for large sizes do rep stosb on x86. It
is very fast nowadays. Faster than FillChar on my Intel laptop. (except
for mid sizes like 128 bytes)
Bye,
Benito
On 16.04.22 01:26, J. Gareth Moreton via fpc-devel wrote:
Hi everyone,
This is something that sprung to mind when thinking about code speed
and the like, and one thing that cropped up is the initialisation of
large variables such as arrays or records. A common means of doing
this is, say:
FillChar(MyVar, SizeOf(MyVar), 0);
To keep things as general-purpose as possible, this usually results in
a function call that decides the best course of action, and for very
large blocks of data whose size may not be deterministic (e.g. a file
buffer), this is the best approach - the overhead is relatively small
and it quickly uses fast block-move instructions.
However, for small-to-mid-sized variables of known size, this can lead
to some inefficiencies, first by not taking into account that the size
of the variable is known, but also because the initialisation value is
zero, more often that not, and the variable is probably aligned on the
stack (so the checks to make sure a pointer is aligned are unnecessary).
I did a proof of concept on x86_64-win64 with the following record:
type
TTestRecord = record
Field1: Byte;
Field2, Field3, Field4: Integer;
end;
SizeOf(TTestRecord) is 16 and all the fields are on 4-byte
boundaries. Nothing particularly special.
I then declared a variable of this time and filled the fields with
random values, and then ran two different methods to clear their
memory. To get a good speed average, I ran each method 1,000,000,000
times in a for-loop. The first method was:
FillChar(TestRecord, SizeOf(TestRecord), 0);
The second method was inline assembly language (which I've called 'the
intrinsic'):
asm
PXOR XMM0, XMM0
MOVDQU [RIP+TestRecord], XMM0
end;2
It's not perfect because the presence of inline assembly prevents the
use of register variables (although TestRecord is always on the stack
regardless), but the performance hit is barely noticeable in this
case, and if the assembly language were inserted by the compiler, the
register variable problem won't arise.
These are my results:
FillChar time: 2.398 ns
Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0
Intrinsic time: 1.336 ns
Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0
Sure, it's on the order of nanoseconds, but the intrinsic is almost
twice as fast.
In terms of size - FillChar call = 20 bytes:
488d0d22080200 lea 0x20822(%rip),%rcx # 0x100022010
4531c0 xor %r8d,%r8d
ba10000000 mov $0x10,%edx
e8150a0000 callq 0x100002210
<SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
The intrinsic = 12 bytes:
660fefc0 pxor %xmm0,%xmm0
f30f7f05bd050200 movdqu %xmm0,0x205bd(%rip) # 0x100022010
For a 32-byte record instead, an extra 8-byte MOVDQU instruction would
be required, so the 2 would be equal size, but with the bonus that the
intrinsic doesn't have a function call and will probably help
optimisation in the rest of the procedure by freeing up the registers
used to pass parameters (%rcx, %rdx and %r8 in this case; although the
intrinsic will require an MM register in this x86_64 example, they
tend to not be used as often). Also, the peephole optimizer can
remove redundant PXOR XMM0, XMM0 calls, which will help as well if
there are multiple FillChar calls.
I'm not proposing a total rewrite, and I would say that in the default
case, it should just fall back to the in-built System functions, but
the relevant compiler nodes could be overridden on specific platforms
to generate smaller, more optimised code when the sizes and values are
known at compile time.
Now, in this example, it is still faster to simply set the fields
manually one-by-one (clocks in at around 1.2 ns), possibly due to the
unaligned write (MOVDQU) and internal SSE state switching adding some
overhead, but there's nothing to stop the compiler from inserting code
in place of the FillChar call to do just that if it thinks it's the
fastest method. Then again, one has to be a little bit careful
because FillChar and the intrinsic will also set the filler bytes
between Field1 and Field2 to 0, whereas manually assigning 0 to the
fields won't (so they aren't strictly equivalent and might only be
allowed if there are no filler bytes or when compiling under -O4, but
the latter may still be dangerous when typecasting is concerned), and
extra care would have to be taken when unions are concerned (sorry,
'union' that's a C term - what's the official Pascal term again?).
Actual Pascal calls to FillChar would not change in any way and so
theoretically it won't break existing code. The only drawback is that
the intrinsic and the internal System functions would have to be named
the same so constructs such as "FuncPtr := @FillChar;" as well as
calling FillChar from assembler routines stilll work, and the compiler
would have to know how to differentiate between the two.
Just on the surface, what are your thoughts?
Garetha ka. Kit
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel