On Fri, Nov 24, 2023 at 05:41:02PM +0800, Kewen.Lin wrote: > on 2023/11/20 16:56, Michael Meissner wrote: > > On Mon, Nov 20, 2023 at 08:24:35AM +0100, Richard Biener wrote: > >> I wouldn't expose the "fake" larger modes to the vectorizer but rather > >> adjust m_suggested_unroll_factor (which you already do to some extent). > > > > Thanks. I figure I first need to fix the shuffle byes issue first and get a > > clean test run (with the flag enabled by default), before delving into the > > vectorization issues. > > > > But testing has shown that at least in the loop I was looking at, that using > > vector pair instructions (either through the built-ins I had previously > > posted > > or with these patches), that even if I turn off unrolling completely for the > > vector pair case, it still is faster than unrolling the loop 4 times for > > using > > vector types (or auto vectorization). Note, of course the margin is much > > smaller in this case. > > > > vector double: (a * b) + c, unroll 4 loop time: 0.55483 > > vector double: (a * b) + c, unroll default loop time: 0.55638 > > vector double: (a * b) + c, unroll 0 loop time: 0.55686 > > vector double: (a * b) + c, unroll 2 loop time: 0.55772 > > > > vector32, w/vector pair: (a * b) + c, unroll 4 loop time: 0.48257 > > vector32, w/vector pair: (a * b) + c, unroll 2 loop time: 0.50782 > > vector32, w/vector pair: (a * b) + c, unroll default loop time: 0.50864 > > vector32, w/vector pair: (a * b) + c, unroll 0 loop time: 0.52224 > > > > Of course being micro-benchmarks, it doesn't mean that this translates to > > the > > behavior on actual code. > > > > > > I noticed that Ajit posted a patch for adding one new pass to replace > contiguous > addresses vector load lxv with lxvp: > > https://inbox.sourceware.org/gcc-patches/ef0c54a5-c35c-3519-f062-9ac78ee66...@linux.ibm.com/ > > How about making this kind of rs6000 specific pass to pair both vector load > and > store? Users can make more unrolling with parameters and those memory > accesses > from unrolling should be neat, I'd expect the pass can easily detect and pair > the > candidates.
Yes, I tend to think a combination of things will be needed. In my tests with a saxpy type loop, I could not get the current built-ins to load/store vector pairs to be fast enough. Peter's code that he posted help, but ultimately it was still slower than adding vector_size(32). I will try out the patch and compare it to my patches. -- Michael Meissner, IBM PO Box 98, Ayer, Massachusetts, USA, 01432 email: meiss...@linux.ibm.com