> On 3 Oct 2024, at 16:41, Richard Sandiford <richard.sandif...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Soumya AR <soum...@nvidia.com> writes: >> From 7fafcb5e0174c56205ec05406c9a412196ae93d3 Mon Sep 17 00:00:00 2001 >> From: Soumya AR <soum...@nvidia.com> >> Date: Thu, 3 Oct 2024 11:53:07 +0530 >> Subject: [PATCH] aarch64: Optimise calls to ldexp with SVE FSCALE instruction >> >> This patch uses the FSCALE instruction provided by SVE to implement the >> standard ldexp family of functions. >> >> Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the >> following code: >> >> float >> test_ldexpf (float x, int i) >> { >> return __builtin_ldexpf (x, i); >> } >> >> double >> test_ldexp (double x, int i) >> { >> return __builtin_ldexp(x, i); >> } >> >> GCC Output: >> >> test_ldexpf: >> b ldexpf >> >> test_ldexp: >> b ldexp >> >> Since SVE has support for an FSCALE instruction, we can use this to process >> scalar floats by moving them to a vector register and performing an fscale >> call, >> similar to how LLVM tackles an ldexp builtin as well. >> >> New Output: >> >> test_ldexpf: >> fmov s31, w0 >> ptrue p7.b, all >> fscale z0.s, p7/m, z0.s, z31.s >> ret >> >> test_ldexp: >> sxtw x0, w0 >> ptrue p7.b, all >> fmov d31, x0 >> fscale z0.d, p7/m, z0.d, z31.d >> ret >> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. >> OK for mainline? > > Could we also use the .H form for __builtin_ldexpf16? > > I suppose: > >> @@ -2286,7 +2289,8 @@ >> (VNx8DI "VNx2BI") (VNx8DF "VNx2BI") >> (V8QI "VNx8BI") (V16QI "VNx16BI") >> (V4HI "VNx4BI") (V8HI "VNx8BI") (V2SI "VNx2BI") >> - (V4SI "VNx4BI") (V2DI "VNx2BI")]) >> + (V4SI "VNx4BI") (V2DI "VNx2BI") >> + (SF "VNx4BI") (DF "VNx2BI")]) > > ...this again raises the question what we should do for predicate > modes when the data mode isn't a natural SVE mode. That came up > recently in relation to V1DI in the popcount patch, and for reductions > in the ANDV etc. patch.
Thanks you for enumerating the options below. > > Three obvious options are: > > (1) Use the nearest SVE mode with a full ptrue (as the patch does). > (2) Use the nearest SVE mode with a 128-bit ptrue. > (3) Add new modes V16BI, V8BI, V4BI, V2BI, and V1BI. (And possibly BI > for scalars.) Just to be clear, what do you mean by “nearest SVE mode” in this context? > > The problem with (1) is that, as Tamar pointed out, it doesn't work > properly with reductions. It also isn't safe for this patch (without > fast-mathy options) because of FP exceptions. Although writing to > a scalar FP register zeros the upper bits, and so gives a "safe" value > for this particular operation, nothing guarantees that all SF and DF > values have this zero-extended form. They could come from subregs of > Advanced SIMD or SVE vectors. The ABI also doesn't guarantee that > incoming SF and DF values are zero-extended. > > (2) would be safe, but would mean that we continue to have an nunits > disagreement between the data mode and the predicate mode. This would > prevent operations being described in generic RTL in future. > > (3) is probably the cleanest representional approach, but has the problem > that we cannot store a fixed-length portion of an SVE predicate. > We would have to load and store the modes via other register classes. > (With PMOV we could use scalar FP loads and stores, but otherwise > we'd probably need secondary memory reloads.) That said, we could > tell the RA to spill in a full predicate mode, so this shouldn't be > a problem unless the modes somehow get exposed to gimple or frontends. > > WDYT? IMO option (2) sounds the more appealing at this stage. To me it feels conceptually straightforward as we are using a SVE operation clamped at 128 bits to “emulate” what should have been an 128-bit fixed-width mode operation. It also feels that, given the complexity of (3) and introducing new modes, we should go for (3) only if/when we do decide to implement these ops with generic RTL. Thanks, Kyrill > > Richard > >> ;; ...and again in lower case. >> (define_mode_attr vpred [(VNx16QI "vnx16bi") (VNx8QI "vnx8bi") >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/fscale.c >> b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c >> new file mode 100644 >> index 00000000000..251b4ef9188 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/fscale.c >> @@ -0,0 +1,16 @@ >> +/* { dg-do compile } */ >> +/* { dg-additional-options "-Ofast" } */ >> + >> +float >> +test_ldexpf (float x, int i) >> +{ >> + return __builtin_ldexpf (x, i); >> +} >> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.s, p[0-7]/m, >> z[0-9]+\.s, z[0-9]+\.s\n} 1 } } */ >> + >> +double >> +test_ldexp (double x, int i) >> +{ >> + return __builtin_ldexp (x, i); >> +} >> +/* { dg-final { scan-assembler-times {\tfscale\tz[0-9]+\.d, p[0-7]/m, >> z[0-9]+\.d, z[0-9]+\.d\n} 1 } } */