Correction to last post. When applying BZHI to an input ("Result :=
Input and ((1 shl x) - 1)"), the initial "mov $-1,%eax" is unnecessary
unless the mask is being preserved, and is just:
bzhil %ecx,(ref-to-Input),%eax
Kit
On 25/10/2022 13:44, J. Gareth Moreton via fpc-devel wrote:
What I want to do is the following...
Say I have the expression "(1 shl x) - 1"... under the default AMD
Athlon optimisations, you might get something like this:
(x in %cl, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax
Under -CpCOREAVX2, you might get this (ignoring any zero-extensions
required on the index):
(x in %ecx, Result in %eax)
movl $1,%eax
shlxl %ecx,%eax,%eax
subl $1,%eax
All of these sequences take at least 3 cycles, or more accurately,
have a dependency chain of length 3. Now consider using BZHI:
(x in %ecx, Result in %eax)
movl $-1,%eax
bzhil %ecx,%eax,%eax
A dependency chain length of 2 (I'm not sure how many cycles it takes
for BZHI to complete execution).
The savings go further if this result is used as a mask then
discarded, i.e. "Result := Input and ((1 shl x) - 1)". Under AMD
Athlon, for example:
(x in %cl, Input in %edx, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax
andl %edx,%eax
Under -CpCOREAVX2:
(x in %ecx, Input in %edx, Result in %eax)
movl $1,%eax
shlxl %ecx,%eax,%eax
subl $1,%eax
andl %edx,%eax
All have a dependency chain length of 4. But with BZHI:
(x in %ecx, Input in %edx, Result in %eax)
movl $-1,%eax
bzhil %ecx,%edx,%eax
Once again, the dependency chain length is reduced to 2. Like with
the earlier two, this sequence also works if Input is a reference
rather than a register; e.g.
(x in %ecx, Result in %eax)
movl $-1,%eax
bzhil %ecx,(ref-to-Input),%eax
A problem, however, arises if the index x is out of range. In the
case of 32-bit operands, the shift instructions in x86 and ARM
(including AArch64) essentially reduce the index modulo 32. "(1 shl
32) - 1" is often expected to return $FFFFFFFF, a mask that covers the
entire bitrange, but "1 shl 32" returns 1 in this case, so the
resultant mask ends up being all zeroes. However, with BZHI, if the
index is out of range, the carry flag is set and the output (%eax in
this case) is set equal to the input, which results in "(1 shl 32) -
1" returning $FFFFFFFF. I think the same thing happens with negative
indices, since BZHI is essentially unsigned (it also only reads the
least significant byte of the index register). With this in mind, and
"1 shl 32" being considered undefined for 32-bit operands, is this an
acceptable optimisation?
Kit
P.S. There is code in the compiler that catches undefined bitmasks and
simply sets it to all ones if the index is 32 or 64 or whatever the
integer word size is. If BZHI is used, a peephole or node
optimisation can be used to eliminate this catch since it becomes
unnecessary with BZHI.
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel