Kyrylo Tkachov <[email protected]> writes:
> Hi Richard,
>
>> On 22 Jan 2025, at 13:21, Richard Sandiford <[email protected]>
>> wrote:
>>
>> GCC 15 is the first release to support FP8 intrinsics.
>> The underlying instructions depend on the value of a new register,
>> FPMR. Unlike FPCR, FPMR is a normal call-clobbered/caller-save
>> register rather than a global register. So:
>>
>> - The FP8 intrinsics take a final uint64_t argument that
>> specifies what value FPMR should have.
>>
>> - If an FP8 operation is split across multiple functions,
>> it is likely that those functions would have a similar argument.
>>
>> If the object code has the structure:
>>
>> for (...)
>> fp8_kernel (..., fpmr_value);
>>
>> then fp8_kernel would set FPMR to fpmr_value each time it is
>> called, even though FPMR will already have that value for at
>> least the second and subsequent calls (and possibly the first).
>>
>> The working assumption for the ABI has been that writes to
>> registers like FPMR can in general be more expensive than
>> reads and so it would be better to use a conditional write like:
>>
>> mrs tmp, fpmr
>> cmp tmp, <value>
>> beq 1f
>> nsr fpmr, <value>
>
> Typo “msr” here and in the comment in the code.
Oops, thanks, will fix.
> [...]
>> @@ -1883,6 +1884,44 @@ (define_split
>> }
>> )
>>
>> +;; The preferred way of writing to the FPMR is to test whether it already
>> +;; has the desired value and branch around the write if so. This reduces
>> +;; the number of redundant FPMR writes caused by ABI boundaries, such as in:
>> +;;
>> +;; for (...)
>> +;; fp8_kernel (..., fpmr_value);
>> +;;
>> +;; Without this optimization, fp8_kernel would set FPMR to fpmr_value each
>> +;; time that it is called.
>> +;;
>> +;; We do this as a split so that hardreg_pre can optimize the moves first.
>> +(define_split
>> + [(set (reg:DI FPM_REGNUM)
>> + (match_operand:DI 0 "aarch64_reg_or_zero"))]
>> + "TARGET_FP8 && !TARGET_CHEAP_FPMR_WRITE && can_create_pseudo_p ()"
>> + [(const_int 0)]
>> + {
>> + auto label = gen_label_rtx ();
>> + rtx current = copy_to_reg (gen_rtx_REG (DImode, FPM_REGNUM));
>> + rtx cond = gen_rtx_EQ (VOIDmode, current, operands[0]);
>> + emit_jump_insn (gen_cbranchdi4 (cond, current, operands[0], label));
>
> Do you think it’s worth marking this jump as likely?
> In some other expand code in the backend where we emit jumps we sometimes use
> aarch64_emit_unlikely_jump.
Ah, yeah, I should have said that I'd wondered about that. But in the
end it didn't seem appropriate. Given that hardreg_pre should remove
local instances of redundancy, we don't really have much information
about whether the branch is likely or unlikely. I think instead the
hope/expectation is that the branch has a predictable pattern.
Thanks,
Richard