[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 06 Jul 2021 06:03:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so some interesting difference is (that's all of the -fopt-info-vec
differences):

-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
+m_mat_nn.c:90:17: optimized: basic block part vectorized using 16 byte vectors

The +m_mat_nn.c:90:17 is mult_su3_nn while the -s_m_a_mat.c:18:18 is
scalar_mult_add_su3_matrix which is inlined at all call sites.   The cases
missing are all inlined into the function update_u.

The odd thing is that we're seeing changes in .vect of update_u like

@@ -3426,46 +3334,40 @@
   # DEBUG j => 0
   # DEBUG BEGIN_STMT
   # DEBUG BEGIN_STMT
-  _918 = MEM <struct site> [(struct su3_matrix
*)s_103].link[dir_67].e[0][0].real;
   _919 = temp1.e[0][0].real;
   _920 = t5_12 * _919;
-  _921 = _918 + _920;
+  _921 = _920 + _1023;
   temp2.e[0][0].real = _921;
   # DEBUG BEGIN_STMT
-  _923 = MEM <struct site> [(struct su3_matrix
*)s_103].link[dir_67].e[0][0].imag;
   _924 = temp1.e[0][0].imag;
   _925 = t5_12 * _924;
-  _926 = _923 + _925;
+  _926 = _925 + _1028;
...

which in the end result in less DRs into SLP and thus a different outcome
there.
This difference starts in the cunrolli dump!?  Dump differences are like

+ipa-modref: call stmt mult_su3_nn (&htemp, link_24, &temp1);
+ipa-modref: call to mult_su3_nn/1705 does not clobber base: temp2 alias sets:
6->5
...
 Value numbering stmt = _938 = link_24->e[i_915][2].real;
-Setting value number of _938 to _938 (changed)
-Making available beyond BB152 _938 for value _938
+ipa-modref: call stmt mult_su3_nn (&htemp, &temp2, &temp1);
+ipa-modref: call to mult_su3_nn/1705 does not clobber base: MEM <struct site>
[(struct su3_matrix *)s_5] alias sets: 6->5
+ipa-modref: call stmt mult_su3_nn (&htemp, link_24, &temp1);
+ipa-modref: call to mult_su3_nn/1705 does not clobber base: MEM <struct site>
[(struct su3_matrix *)s_5] alias sets: 6->5
+Setting value number of _938 to _1043 (changed)
+_1043 is available for _1043
+Replaced link_24->e[i_915][2].real with _1043 in all uses of _938 =
link_24->e[i_915][2].real;

it's really odd, the WPA and LTRANS modref dumps do not show any difference
but the above looks like IPA summary is once available and once not.  Ah,
the late modref pass results spill over and it looks like we "improve" here:

   loads:
     Limits: 32 bases, 16 refs
-      Base 0: alias set 6
+      Base 0: alias set 5
+        Ref 0: alias set 5
+          Every access
+      Base 1: alias set 6
         Ref 0: alias set 5
           Every access
   stores:
     Limits: 32 bases, 16 refs
-      Base 0: alias set 6
+      Base 0: alias set 5
         Ref 0: alias set 5
-          Every access
+          access: Parm 2 param offset:0 offset:0 size:128 max_size:128
+          access: Parm 2 param offset:16 offset:0 size:128 max_size:128
+          access: Parm 2 param offset:48 offset:0 size:128 max_size:128
+          access: Parm 2 param offset:64 offset:0 size:128 max_size:128
+          access: Parm 2 param offset:112 offset:0 size:128 max_size:128
+      Base 1: alias set 6
+        Ref 0: alias set 5
+          access: Parm 2 param offset:0 offset:256 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:320 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:640 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:704 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:768 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:832 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:1024 size:64 max_size:64
+          access: Parm 2 param offset:0 offset:1088 size:64 max_size:64
   parm 0 flags: nodirectescape
   parm 1 flags: nodirectescape
   parm 2 flags: direct noescape nodirectescape
 void mult_su3_nn (struct su3_matrix * a, struct su3_matrix * b, struct
su3_matrix * c)

I'm not sure what "Every access" means but I suppose it's "bad" here.  Maybe
it's

  - Analyzing load: b_10(D)->e[2][1].real
    - Recording base_set=6 ref_set=5 parm=1
---param param=modref-max-accesses limit reached
  - Analyzing load: b_10(D)->e[2][1].imag
    - Recording base_set=6 ref_set=5 parm=1
... (a lot) ...
+--param param=modref-max-accesses limit reached
  - Analyzing load: a_7(D)->e[1][1].imag
    - Recording base_set=6 ref_set=5 parm=0
  - ECF_CONST | ECF_NOVOPS, ignoring all stores and all loads except for args.

so eventually vectorizing helps reducing the number of accesses and thus
running into this case?  Using --param modref-max-accesses=64 avoids the
differences in vectorizing besides the expected

+m_mat_nn.c:90:17: optimized: basic block part vectorized using 16 byte vectors

-fno-ipa-modref does the trick as well.  But unfortunately neither manages
to produce binaries that fix the runtime difference or make the perf
report any clearer :/

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

Reply via email to