https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460
Bug ID: 91460
Summary: gcc -mpreferred-vector-width=256 is slower than
-mpreferred-vector-width=128 for some loops
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: skpgkp2 at gmail dot com
CC: hjl.tools at gmail dot com
Target Milestone: ---
1 static inline void pixel_avg( uint8_t *dst, int i_dst_stride,
2 uint8_t *src1, int i_src1_stride,
3 uint8_t *src2, int i_src2_stride,
4 int i_width, int i_height )
5 {
6 for( int y = 0; y < i_height; y++ )
7 {
8 for( int x = 0; x < i_width; x++ )
9 dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
10 dst += i_dst_stride;
11 src1 += i_src1_stride;
12 src2 += i_src2_stride;
13 }
14 }
If above code is in hot loop.
if i_width value is between 16 and 32, -mprefer-vector-width=128 can provide
~6% performance improvement as compared to -mprefer-vector-width=256.
i_width value must be at least 16 to trigger 128 bit vectorization at line 8.
i_width value must be at least 32 to trigger 256 bit vectorization at line 8.