Re: [PATCH 1/3] aarch64: Add support for fp8 convert and scale

Kyrylo Tkachov Thu, 07 Nov 2024 01:03:56 -0800

Hi Saurabh,

> On 6 Nov 2024, at 11:03, saurabh....@arm.com wrote:
> 
> 
> The AArch64 FEAT_FP8 extension introduces instructions for conversion
> and scaling.
> 
> This patch introduces the following intrinsics:
> 1. vcvt{1|2}_{bf16|high_bf16|low_bf16}_mf8_fpm.
> 2. vcvt{q}_mf8_f16_fpm.
> 3. vcvt_{high}_mf8_f32_fpm.
> 4. vscale{q}_{f16|f32|f64}.
> 
> We introduced three new aarch64_builtin_signatures enum variants:
> 1. binary_fpm.
> 2. ternary_fpm.
> 3. unary_fpm.
> 
> We added support for these variants for declaring types and for expanding to 
> RTL.
> 
> We added new simd_types for integers (s32, s32q, and s64q) and for
> fp8 (f8, and f8q).
> 
> Also changed the faminmax intrinsic instruction pattern so that it works
> better with the new fscale pattern.
> 
> Because we added support for fp8 intrinsics here, we modified the check
> in acle/fp8.c that was checking that __ARM_FEATURE_FP8 macro is not
> defined.
> 
> gcc/ChangeLog:
> 
> * config/aarch64/aarch64-builtins.cc
> (enum class): New variants to support new signatures.
> (aarch64_fntype): Handle new signatures.
> (aarch64_expand_pragma_builtin): Handle new signatures.
> * config/aarch64/aarch64-c.cc
> (aarch64_update_cpp_builtins): New flag for FP8.
> * config/aarch64/aarch64-simd-pragma-builtins.def
> (ENTRY_BINARY_FPM): Macro to declare unary fpm intrinsics.
> (ENTRY_TERNARY_FPM): Macro to declare ternary fpm intrinsics.
> (ENTRY_UNARY_FPM): Macro to declare unary fpm intrinsics.
> (ENTRY_VHSDF_VHSDI): Macro to declare binary intrinsics.
> * config/aarch64/aarch64-simd.md
> (@aarch64_<faminmax_uns_op><mode>): Renamed.
> (@aarch64_<faminmax_uns_op><VHSDF:mode><VHSDF:mode>): Renamed.
> (@aarch64_<fpm_uns_name><V8HFBF:mode><VB:mode>): Unary fpm
> pattern.
> (@aarch64_<fpm_uns_name><V8HFBF:mode><V16QI_ONLY:mode>): Unary
> fpm pattern.
> (@aarch64_<fpm_uns_name><VB:mode><VCVTFPM:mode><VH_SF:mode>):
> Binary fpm pattern.
> (@aarch64_<fpm_uns_name><V16QI_ONLY:mode><V8QI_ONLY:mode><V4SF_ONLY:mode><V4SF_ONLY:mode>):
> Ternary fpm pattern.
> (@aarch64_<fpm_uns_op><VHSDF:mode><VHSDI:mode>): Scale fpm
> pattern.
> * config/aarch64/iterators.md: New attributes and iterators.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/acle/fp8.c: Remove check that fp8 feature
> macro doesn't exist.
> * gcc.target/aarch64/simd/scale_fpm.c: New test.
> * gcc.target/aarch64/simd/vcvt_fpm.c: New test.
> 
> ---
> 
> I could not find a way to compress declarations in
> aarch64-simd-pragma-builtins.def for convert instructions as there was
> no pattern apart from the repetion for vcvt1/vcvt2 types. Let me know
> if those declrations can be expressed more concisely.
> 
> In the scale instructions, I am not doing any casting from float to int
> modes in the second operand. Let me know if that's a problem.
> ---
> gcc/config/aarch64/aarch64-builtins.cc        | 132 ++++++++++--
> gcc/config/aarch64/aarch64-c.cc               |   2 +
> .../aarch64/aarch64-simd-pragma-builtins.def  |  56 +++++
> gcc/config/aarch64/aarch64-simd.md            |  72 ++++++-
> gcc/config/aarch64/iterators.md               |  99 +++++++++
> gcc/testsuite/gcc.target/aarch64/acle/fp8.c   |  10 -
> .../gcc.target/aarch64/simd/scale_fpm.c       |  60 ++++++
> .../gcc.target/aarch64/simd/vcvt_fpm.c        | 197 ++++++++++++++++++
> 8 files changed, 603 insertions(+), 25 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/scale_fpm.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/vcvt_fpm.c
> 
> <0001-aarch64-Add-support-for-fp8-convert-and-scale.patch>


diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index cfe95bd4c31..87bbfb0e586 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -9982,13 +9982,13 @@
 )
 
 ;; faminmax
-(define_insn "@aarch64_<faminmax_uns_op><mode>"
+(define_insn "@aarch64_<faminmax_uns_op><VHSDF:mode><VHSDF:mode>"
   [(set (match_operand:VHSDF 0 "register_operand" "=w")
        (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand" "w")
                       (match_operand:VHSDF 2 "register_operand" "w")]
                      FAMINMAX_UNS))]
   "TARGET_FAMINMAX"
-  "<faminmax_uns_op>\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
+  "<faminmax_uns_op>\t%0.<Vtype>, %1.<VHSDF:Vtype>, %2.<VHSDF:Vtype>"
 )
 
 (define_insn "*aarch64_faminmax_fused"
@@ -9999,3 +9999,71 @@
   "TARGET_FAMINMAX"
   "<faminmax_op>\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
 )
+
+;; fpm unary instructions.
+(define_insn "@aarch64_<fpm_uns_name><V8HFBF:mode><VB:mode>"
+  [(set (match_operand:V8HFBF 0 "register_operand" "=w")
+       (unspec:V8HFBF
+        [(match_operand:VB 1 "register_operand" "w")
+         (reg:DI FPM_REGNUM)]
+       FPM_UNARY_UNS))]
+  "TARGET_FP8"
+  "<fpm_uns_op>\t%0.<V8HFBF:Vtype>, %1.<VB:Vtype>"
+)
+
+;; fpm unary instructions, where the input is lowered from V16QI to
+;; V8QI.
+(define_insn "@aarch64_<fpm_uns_name><V8HFBF:mode><V16QI_ONLY:mode>"
+  [(set (match_operand:V8HFBF 0 "register_operand" "=w")
+       (unspec:V8HFBF
+        [(match_operand:V16QI_ONLY 1 "register_operand" "w")
+         (reg:DI FPM_REGNUM)]
+       FPM_UNARY_LOW_UNS))]
+  "TARGET_FP8"
+  {
+    operands[1] = force_lowpart_subreg (V8QImode,
+                                       operands[1],
+                                       recog_data.operand[1]->mode);

I don’t think this is needed? This code is only executed in the final assembly 
output stage and you already explicitly print operand 1 with a “.8b” suffix so 
changing the mode here doesn’t matter.

+    return "<fpm_uns_op>\t%0.<V8HFBF:Vtype>, %1.8b";
+  }
+)

+;; fpm ternary instructions.
+(define_insn
+  
"@aarch64_<fpm_uns_name><V16QI_ONLY:mode><V8QI_ONLY:mode><V4SF_ONLY:mode><V4SF_ONLY:mode>"
+  [(set (match_operand:V16QI_ONLY 0 "register_operand" "=w")
+       (unspec:V16QI_ONLY
+        [(match_operand:V8QI_ONLY 1 "register_operand" "w")
+         (match_operand:V4SF_ONLY 2 "register_operand" "w")
+         (match_operand:V4SF_ONLY 3 "register_operand" "w")
+         (reg:DI FPM_REGNUM)]
+       FPM_TERNARY_VCVT_UNS))]
+  "TARGET_FP8"
+  {
+    operands[1] = force_reg (V16QImode, operands[1]);
+    return "<fpm_uns_op>\t%1.16b, %2.<V4SF_ONLY:Vtype>, %3.<V4SF_ONLY:Vtype>";
+  }
+)

Same here. But more worryingly the destination operand 0 is not being printed 
out anywhere here. Was there supposed to be a tie of one of the input operands 
to operand 0 in this pattern?
I haven’t looked deeply into what exactly these instructions do, but please 
double check the operands here.
Thanks,
Kyrill

Re: [PATCH 1/3] aarch64: Add support for fp8 convert and scale

Reply via email to