On 09/02/2023 20:13, Andrew Jenner wrote:
This patch introduces instruction patterns for complex number operations
in the GCN machine description. These patterns are cmul, cmul_conj,
vec_addsub, vec_fmaddsub, vec_fmsubadd, cadd90, cadd270, cmla and cmls
(cmla_conj and cmls_conj were not found to be favorable to implement).
As a side effect of adding cmls, I also added fms patterns corresponding
to the existing fma patterns. Tested on CDNA2 GFX90a.
OK to commit?
gcc/ChangeLog:
* config/gcn/gcn-protos.h (gcn_expand_dpp_swap_pairs_insn)
(gcn_expand_dpp_distribute_even_insn)
(gcn_expand_dpp_distribute_odd_insn): Declare.
* config/gcn/gcn-valu.md (@dpp_swap_pairs<mode>)
(@dpp_distribute_even<mode>, @dpp_distribute_odd<mode>)
(cmul<conj_op><mode>3, cml<addsub_as><mode>4, vec_addsub<mode>3)
(cadd<rot><mode>3, vec_fmaddsub<mode>4, vec_fmsubadd<mode>4)
(fms<mode>4<exec>, fms<mode>4_negop2<exec>, fms<mode>4)
(fms<mode>4_negop2): New patterns.
* config/gcn/gcn.cc (gcn_expand_dpp_swap_pairs_insn)
(gcn_expand_dpp_distribute_even_insn)
(gcn_expand_dpp_distribute_odd_insn): New functions.
* config/gcn/gcn.md: Add entries to unspec enum.
gcc/testsuite/ChangeLog:
* gcc.target/gcn/complex.c: New test.
+;; It would be possible to represent these without the UNSPEC as
+;;
+;; (vec_merge
+;; (fma op1 op2 op3)
+;; (fma op1 op2 (neg op3))
+;; (merge-const))
+;;
+;; But this doesn't seem useful in practice.
+
+(define_expand "vec_fmaddsub<mode>4"
+ [(set (match_operand:V_noHI 0 "register_operand" "=&v")
+ (unspec:V_noHI
+ [(match_operand:V_noHI 1 "register_operand" "v")
+ (match_operand:V_noHI 2 "register_operand" "v")
+ (match_operand:V_noHI 3 "register_operand" "v")]
+ UNSPEC_FMADDSUB))]
This is a define_expand pattern that has a custom-code expansion with an
unconditional "DONE", so the actual RTL representation is irrelevant
here: it only needs to have the match_operand entries. The
UNSPEC_FMADDSUB is therefore dead (as in, it will never appear in the
IR). We can safely remove those, although I don't hate them for
readability purposes.
The UNSPEC_CMUL and UNSPEC_CMUL_CONJ are similarly "dead", but since you
use them for an iterator they're still useful in the machine description.
+(define_insn "fms<mode>4<exec>"
+ [(set (match_operand:V_FP 0 "register_operand" "= v, v")
+ (fma:V_FP
+ (match_operand:V_FP 1 "gcn_alu_operand" "% vA, vA")
+ (match_operand:V_FP 2 "gcn_alu_operand" " vA,vSvA")
+ (neg:V_FP
+ (match_operand:V_FP 3 "gcn_alu_operand" "vSvA, vA"))))]
+ ""
+ "v_fma%i0\t%0, %1, %2, -%3"
+ [(set_attr "type" "vop3a")
+ (set_attr "length" "8")])
Please ensure that the alternatives are vertically aligned in the same
style as the rest of the file.
+/* Generate DPP pairwise swap instruction.
+ The opcode is given by INSN. */
+
+char *
+gcn_expand_dpp_swap_pairs_insn (machine_mode mode, const char *insn,
+ int ARG_UNUSED (unspec))
....
+/* Generate DPP distribute even instruction.
+ The opcode is given by INSN. */
+
+char *
+gcn_expand_dpp_distribute_even_insn (machine_mode mode, const char *insn,
+ int ARG_UNUSED (unspec))
....
+/* Generate DPP distribute odd instruction.
+ The opcode is given by INSN. */
+
+char *
+gcn_expand_dpp_distribute_odd_insn (machine_mode mode, const char *insn,
+ int ARG_UNUSED (unspec))
Please add a comment that isn't just the function name in words. Explain
what operation happens here and maybe show an example of what it produces.
+++ b/gcc/testsuite/gcc.target/gcn/complex.c
@@ -0,0 +1,640 @@
+// { dg-do run }
+// { dg-options "-O -fopenmp-simd -ftree-loop-if-convert -fno-ssa-phiopt" }
Does the -fopenmp-simd option do anything here? There are no "omp
declare simd" directives.
+void cmulF(float *td, float *te, float *tf, float *tg, int tas)
+{
+ typedef _Complex float complexT;
+ int array_size = tas/2;
+ complexT *d = (complexT*)(td);
+ complexT *e = (complexT*)(te);
+ complexT *f = (complexT*)(tf);
+#pragma omp target teams distribute parallel for simd
+ for (int i = 0; i < array_size; i++)
+ {
+ d[i] = e[i] * f[i];
+ }
+}
Tests in gcc.target/gcn won't do anything with "omp target" directives.
I would expect the loop to vectorize without, at -O2 or above (or "-O1
-ftree-vectorize"), but you might find the output easier to read with
"__restrict" on the parameters as that will avoid emitting the runtime
alias check and scalar code implementation.
I'd also expect you to have to do something to avoid inlining.
+ td = (float*)omp_aligned_alloc(ALIGNMENT, sizeof(float)*array_size,
omp_default_mem_alloc);
+ te = (float*)omp_aligned_alloc(ALIGNMENT, sizeof(float)*array_size,
omp_default_mem_alloc);
+ tf = (float*)omp_aligned_alloc(ALIGNMENT, sizeof(float)*array_size,
omp_default_mem_alloc);
+ tg = (float*)omp_aligned_alloc(ALIGNMENT, sizeof(float)*array_size,
omp_default_mem_alloc);
There's no need to use libgomp to allocate memory on the device --
malloc works just fine -- and it doesn't need to be specifically aligned
unless you're wanting performance.
In general I'm confused by this testcase because it looks like it was
written for an offloading toolchain, but it's placed into the
bare-machine GCN testsuite.
Andrew