https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, both the following patches should fix it IMHO, but no idea which one if any
is right.
With
--- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100
+++ gcc/config/rs6000/vsx.md    2019-01-18 18:07:37.194899062 +0100
@@ -4356,7 +4356,9 @@
   ""
   [(const_int 0)]
 {
-  rtx hi = gen_highpart (DFmode, operands[1]);
+  rtx hi = (BYTES_BIG_ENDIAN
+           ? gen_highpart (DFmode, operands[1])
+           : gen_lowpart (DFmode, operands[1]));
   rtx lo = (GET_CODE (operands[2]) == SCRATCH)
            ? gen_reg_rtx (DFmode)
            : operands[2];

the assembly changes:
--- reduction-3.s1      2019-01-18 18:05:14.313229730 +0100
+++ reduction-3.s2      2019-01-18 18:10:20.617233358 +0100
@@ -27,7 +27,7 @@ MAIN__._omp_fn.0:
        addi 9,9,16
        bdnz .L2
         # vec_extract to same register
-       lfd 12,-8(1)
+       lfd 12,-16(1)
        xsmaxdp 0,12,0
        stfd 0,0(10)
        blr
with:
--- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100
+++ gcc/config/rs6000/vsx.md    2019-01-18 18:16:30.680186709 +0100
@@ -4361,7 +4361,9 @@
            ? gen_reg_rtx (DFmode)
            : operands[2];

-  emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx));
+  emit_insn (gen_vsx_extract_v2df (lo, operands[1],
+                                  BYTES_BIG_ENDIAN
+                                  ? const1_rtx : const0_rtx));
   emit_insn (gen_<VEC_reduc_rtx>df3 (operands[0], hi, lo));
   DONE;
 }
the assembly changes:
--- reduction-3.s1      2019-01-18 18:05:14.313229730 +0100
+++ reduction-3.s3      2019-01-18 18:17:18.977397458 +0100
@@ -26,7 +26,7 @@ MAIN__._omp_fn.0:
        xxpermdi 0,0,0,2
        addi 9,9,16
        bdnz .L2
-        # vec_extract to same register
+       xxpermdi 0,0,0,3
        lfd 12,-8(1)
        xsmaxdp 0,12,0
        stfd 0,0(10)

So just judging from this exact testcase, the first patch seems to be more
efficient, though still unsure about that, because it goes through memory in
either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that swaps
the two elements instead of loading it from memory?

Reply via email to