Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

Kyrylo Tkachov Thu, 03 Oct 2024 08:45:40 -0700


> On 3 Oct 2024, at 16:41, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR <soum...@nvidia.com> writes:
>> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00 2001
>> From: Soumya AR <soum...@nvidia.com>
>> Date: Thu, 3 Oct 2024 11:53:07 +0530
>> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction
>> 
>> This patch uses the FSCALE instruction provided by SVE to implement the
>> standard ldexp family of functions.
>> 
>> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
>> following code:
>> 
>> float
>> test_ldexpf (float x, int i)
>> {
>>      return __builtin_ldexpf (x, i);
>> }
>> 
>> double
>> test_ldexp (double x, int i)
>> {
>>      return __builtin_ldexp(x, i);
>> }
>> 
>> GCC Output:
>> 
>> test_ldexpf:
>>      b ldexpf
>> 
>> test_ldexp:
>>      b ldexp
>> 
>> Since SVE has support for an FSCALE instruction, we can use this to process
>> scalar floats by moving them to a vector register and performing an fscale 
>> call,
>> similar to how LLVM tackles an ldexp builtin as well.
>> 
>> New Output:
>> 
>> test_ldexpf:
>>      fmov s31, w0
>>      ptrue p7.b, all
>>      fscale z0.s, p7/m, z0.s, z31.s
>>      ret
>> 
>> test_ldexp:
>>      sxtw x0, w0
>>      ptrue p7.b, all
>>      fmov d31, x0
>>      fscale z0.d, p7/m, z0.d, z31.d
>>      ret
>> 
>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
>> OK for mainline?
> 
> Could we also use the .H form for __builtin_ldexpf16?
> 
> I suppose:
> 
>> @@ -2286,7 +2289,8 @@
>>                       (VNx8DI "VNx2BI") (VNx8DF "VNx2BI")
>>                       (V8QI "VNx8BI") (V16QI "VNx16BI")
>>                       (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI")
>> -                      (V4SI "VNx4BI") (V2DI "VNx2BI")])
>> +                      (V4SI "VNx4BI") (V2DI "VNx2BI")
>> +                      (SF "VNx4BI") (DF "VNx2BI")])
> 
> ...this again raises the question what we should do for predicate
> modes when the data mode isn't a natural SVE mode.  That came up
> recently in relation to V1DI in the popcount patch, and for reductions
> in the ANDV etc. patch.


Thanks you for enumerating the options below.

> 
> Three obvious options are:
> 
> (1) Use the nearest SVE mode with a full ptrue (as the patch does).
> (2) Use the nearest SVE mode with a 128-bit ptrue.
> (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI.  (And possibly BI
>    for scalars.)

Just to be clear, what do you mean by “nearest SVE mode” in this context?


> 
> The problem with (1) is that, as Tamar pointed out, it doesn't work
> properly with reductions.  It also isn't safe for this patch (without
> fast-mathy options) because of FP exceptions.  Although writing to
> a scalar FP register zeros the upper bits, and so gives a "safe" value
> for this particular operation, nothing guarantees that all SF and DF
> values have this zero-extended form.  They could come from subregs of
> Advanced SIMD or SVE vectors.  The ABI also doesn't guarantee that
> incoming SF and DF values are zero-extended.
> 
> (2) would be safe, but would mean that we continue to have an nunits
> disagreement between the data mode and the predicate mode.  This would
> prevent operations being described in generic RTL in future.
> 
> (3) is probably the cleanest representional approach, but has the problem
> that we cannot store a fixed-length portion of an SVE predicate.
> We would have to load and store the modes via other register classes.
> (With PMOV we could use scalar FP loads and stores, but otherwise
> we'd probably need secondary memory reloads.)  That said, we could
> tell the RA to spill in a full predicate mode, so this shouldn't be
> a problem unless the modes somehow get exposed to gimple or frontends.
> 
> WDYT?

IMO option (2) sounds the more appealing at this stage. To me it feels
conceptually straightforward as we are using a SVE operation clamped at
128 bits to “emulate” what should have been an 128-bit fixed-width mode
operation.
It also feels that, given the complexity of (3) and introducing new modes,
we should go for (3) only if/when we do decide to implement these ops with
generic RTL.

Thanks,
Kyrill

> 
> Richard
> 
>> ;; ...and again in lower case.
>> (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi")
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> new file mode 100644
>> index 00000000000..251b4ef9188
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c
>> @@ -0,0 +1,16 @@
>> +/* { dg-do compile } */
>> +/* { dg-additional-options "-Ofast" } */
>> +
>> +float
>> +test_ldexpf (float x, int i)
>> +{
>> +  return __builtin_ldexpf (x, i);
>> +}
>> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m, 
>> z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */
>> +
>> +double
>> +test_ldexp (double x, int i)
>> +{
>> +  return __builtin_ldexp (x, i);
>> +}
>> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m, 
>> z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */

Re: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction

Reply via email to