https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rsand...@gcc.gnu.org from comment #3)
> (In reply to Tamar Christina from comment #2)
> > I thought the SLP algorithm was bottom up and stores were
> > already sinks?
> Yeah, they are.  But the point is that we're vectorising
> the stores in isolation, with no knowledge of what happens
> later.  The reason the code here is particularly bad is
> that the array is later loaded into a vector.  But the
> vectoriser doesn't know that.
> 

Ah right, you meant use the loads as the seeds. yeah makes sense.

> > Ah, guess there are two problems.
> > 
> > 1. how did we end up with such poor scalar code, at least 5 instructions are
> > unneeded (separate issue)
> > 2. The costing of the above, I guess I'm still slightly confused how we got
> > to that cost
> The patch that introduce the regression uses an on-the-side costing
> scheme for store sequences.  If it thinks that the scalar code is
> better, it manipulates the vector body cost so that the body is twice
> as expensive as the scalar body.  The prologue cost (1 for the
> scalar_to_vec) is then added on top.

Ah, that makes sense.

> > If it's costing purely on latency than the two are equivalent no? if you
> > take throughput into account the first would win, but the difference in
> > costs is still a lot higher then I would have expected.
> > 
> > In this case:
> > 
> > node 0x4f45480 1 times scalar_to_vec costs 4 in prologue
> > 
> > seems quite high, but I guess it doesn't know that there's no regfile
> > transfer?
> Which -mcpu/-mtune are you using?  For generic it's 1 rather than 4
> (so that the vector cost is 9 rather than 12, although still
> higher than the scalar cost).

I was using neoverse-v1 which looks like matches neoverse-n2 in cost of 4, but
neoverse-n1 has 6.  that really seems excessive..

Reply via email to