Re: [fpc-devel] Register renaming and false dependency question

J. Gareth Moreton via fpc-devel Mon, 18 Oct 2021 17:49:25 -0700

You need to specify "unsigned long long" to produce the equivalent of"(Input * $AAAAAAAAAAAAAAAB) <= $5555555555555555". And yes, my ownexperiments showed that when optimising for size, the C++ compilers usethe actual division operations and other instructions with short machinecode representations (e.g. CQDE/cqto is only 2 bytes long) rather thanthe algorithms listed in Hacker's Delight and other sources.

When optimising for size, using a logical shift or some other simplearithmetic instruction (i.e. anything other than division and maybemultiplication) to transmute one constant into another is perfectlyacceptable if the byte encoding is smaller. For a processor like theARM-v8, which only as 2 ALUs if I remember correctly, using such atransmutation might be preferable, especially if the load cannot berepresented in a single instruction (in the case of "mod 3", I believe"or x0, #0x5555555555555555" is possible with the barrel shifter, sinceit's a short, repeating bit sequence. Larger dividends may not be sofortunate).

When going for speed, obviously using MOV is safest, but I do like tosee if I can shrink code size without sacrificing speed (my first everpatch submission for FPC was changing movq const,%regq to movlconst,%regd if const was between 0 and 2^32 - 1, and using a shortermachine code encoding when writing small negative numbers to 64-bitregisters, shaving off 50kB from the compiler binary). I enjoy thechallenge!


Gareth aka. Kit

On 18/10/2021 16:56, Stefan Glienke via fpc-devel wrote:

None of the aforementioned c++ compilers does this optimization with either 
O2/3 or Os - see https://godbolt.org/z/vrPfd3xej (could be I was missing one of 
the many additional flags you can pass there though)

In the case of the 2 mov instructions, the CPU most likely uses register 
renaming for the second mov into rax to eliminate the write-after-read 
dependency. That means both mov instructions can happen at the same time. Using 
shr causes a dependency so it has to execute after the mov. I don't know though 
how that interfers with rax also being input of the imul instruction and when 
the shl can actually execute. You will have to profile this with something like 
https://github.com/travisdowns/uarch-bench or alike.

On 18/10/2021 11:14 J. Gareth Moreton via fpc-devel 
<[email protected]> wrote:

What I'm figuring is that it should be fine for -Os, since that

optimises for size over speed, and -Os uses "or $-1, %rax" for setting a
register to -1 instead of "mov $-1, %rax".  I copied this behaviour from
MSVC after finding a thread about the false dependency that OR causes
(Florian warned me of the false dependency when I proposed the
optimisation regardless of size or speed).

Outside of -Os, if the immediate previous instruction uses the same
register, then it's guaranteed to have been loaded and there shouldn't
be a pipeline stall, although out-of-order execution may cause some
problems.  This one is going to be interesting to time.

In regards to how frequently it appears in code, if it turns out that
it's very infrequent (normally I compile and analyse the RTL for this
purpose, and sometimes the compiler itself), I likely won't code it,
although in the case of the "fast mod" algorithm, it appears very
frequently, so should probably be coded directly, especially as the
multiplication can take 3 cycles, depending on the processor, and is the
limiting factor.

Gareth aka. Kit

On 17/10/2021 23:53, Stefan Glienke via fpc-devel wrote:

According to compiler explorer clang, gcc and msvc compile this to the
same code with -O3 as FPC does. So I would assume that is fine.

Am 17.10.2021 um 13:25 schrieb J. Gareth Moreton via fpc-devel:

Hi everyone,

While reading up on some algorithms, I came across a recommendation
of using a shorter arithmetic function to change the value of a
constant in a register rather than loading the new value directly.
However, the algorithm assumes a RISC-like processor, so I'm not sure
if it applies to an Intel x86-64 processor.  Consider the following:

movq $0xaaaaaaaaaaaaaaab,%rax
imulq   %rax,%rcx
movq $0x5555555555555555,%rax
cmpq    %rax,%rcx
setle  %al

This algorithm sets %al to 1 if %rcx is divisible by 3, and 0 if it's
not, and was compiled from the following Pascal code (under -O3, but
-O1 produces almost exactly the same):

function IsDivisible3(Numerator: QWord): Boolean;
begin
   Result := (Numerator * $AAAAAAAAAAAAAAAB) <= $5555555555555555;
end;

(One of my merge requests produces this code from "Result := (x mod
3) = 0")

My question is this: can "movq $0x5555555555555555,%rax" be replaced
with "shrq $0x1,%rax" without incurring an additional pipeline
stall?  The MOV instruction takes 10 bytes to store, while "SHR 1"
takes only 3.  Given that %rax is used beforehand and the CMP
instruction has to wait until the IMUL instruction has finished
executing, logic tells me that I can get away with it here, but I'm
not sure if the metric to go by is the execution speed of IMUL (i.e.
the IMUL instruction is the limiting factor before CMP can be
executed), or the simple fact that the previous value of %rax was
used and will be loaded with $AAAAAAAAAAAAAAAB by the time it comes
to load it with a new value.

Gareth aka. Kit

_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

_______________________________________________
fpc-devel maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Register renaming and false dependency question

Reply via email to