[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

rguenth at gcc dot gnu.org Thu, 25 Jan 2018 05:04:51 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #43084|0                           |1
        is obsolete|                            |

--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 43238
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43238&action=edit
updated patch for SLP costing

You are right, the patch contained several errors.  The multiple_of_p is
supposed to handle the case where we know all vectors will be equal, like when
group_size is two and const_nunits is 4.  The arguments were swapped.  Also the
loop over
the elements were bogus.  I've corrected this with the attached updated patch.

This now costs two vector constructions for the testcase as expected but still:

t.c:32:12: note: Cost model analysis:
  Vector inside of basic block cost: 32
  Vector prologue cost: 64
  Vector epilogue cost: 0
  Scalar cost of basic block: 192
t.c:32:12: note: Basic block will be vectorized using SLP

that's for two aligned stores and two 8 element vector constructions.  We're
offsetting 16 scalar stores after all...  They each seem to cost 12 while
an aligned vector store costs 16.  And the vector constructions cost 32 each
(8 times a SSE op costing 4 aka "element insert").

The only thing I notice is that

40240   ix86_vec_cost (machine_mode mode, int cost, bool parallel)
40241   {
40242     if (!VECTOR_MODE_P (mode))
40243       return cost;
40244    
40245     if (!parallel)
40246       return cost * GET_MODE_NUNITS (mode);
40247     if (GET_MODE_BITSIZE (mode) == 128
40248         && TARGET_SSE_SPLIT_REGS)
40249       return cost * 2;
(gdb) 
40250     if (GET_MODE_BITSIZE (mode) > 128
40251         && TARGET_AVX128_OPTIMAL)
40252       return cost * GET_MODE_BITSIZE (mode) / 128;
40253     return cost;
40254   }

all the pessimizing for TARGET_SSE_SPLIT_REGS/TARGET_AVX128_OPTIMAL isn't
applied to the !parallel case.  But they wouldn't apply to AVX512 AFAICS.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

Reply via email to