On Fri, Nov 24, 2023 at 05:41:02PM +0800, Kewen.Lin wrote:
> on 2023/11/20 16:56, Michael Meissner wrote:
> > On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote:
> >> I wouldn't expose the "fake" larger modes to the vectorizer but rather
> >> adjust m_suggested_unroll_factor (which you already do to some extent).
> > 
> > Thanks.  I figure I first need to fix the shuffle byes issue first and get a
> > clean test run (with the flag enabled by default), before delving into the
> > vectorization issues.
> > 
> > But testing has shown that at least in the loop I was looking at, that using
> > vector pair instructions (either through the built-ins I had previously 
> > posted
> > or with these patches), that even if I turn off unrolling completely for the
> > vector pair case, it still is faster than unrolling the loop 4 times for 
> > using
> > vector types (or auto vectorization).  Note, of course the margin is much
> > smaller in this case.
> > 
> > vector double:           (a * b) + c, unroll 4         loop time: 0.55483
> > vector double:           (a * b) + c, unroll default   loop time: 0.55638
> > vector double:           (a * b) + c, unroll 0         loop time: 0.55686
> > vector double:           (a * b) + c, unroll 2         loop time: 0.55772
> > 
> > vector32, w/vector pair: (a * b) + c, unroll 4         loop time: 0.48257
> > vector32, w/vector pair: (a * b) + c, unroll 2         loop time: 0.50782
> > vector32, w/vector pair: (a * b) + c, unroll default   loop time: 0.50864
> > vector32, w/vector pair: (a * b) + c, unroll 0         loop time: 0.52224
> > 
> > Of course being micro-benchmarks, it doesn't mean that this translates to 
> > the
> > behavior on actual code.
> > 
> > 
> 
> I noticed that Ajit posted a patch for adding one new pass to replace 
> contiguous
> addresses vector load lxv with lxvp:
> 
> https://inbox.sourceware.org/gcc-patches/ef0c54a5-c35c-3519-f062-9ac78ee66...@linux.ibm.com/
> 
> How about making this kind of rs6000 specific pass to pair both vector load 
> and
> store?  Users can make more unrolling with parameters and those memory 
> accesses
> from unrolling should be neat, I'd expect the pass can easily detect and pair 
> the
> candidates.

Yes, I tend to think a combination of things will be needed.  In my tests with
a saxpy type loop, I could not get the current built-ins to load/store vector
pairs to be fast enough.  Peter's code that he posted help, but ultimately it
was still slower than adding vector_size(32).  I will try out the patch and
compare it to my patches.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com

Reply via email to