https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460
Bug ID: 91460 Summary: gcc -mpreferred-vector-width=256 is slower than -mpreferred-vector-width=128 for some loops Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: skpgkp2 at gmail dot com CC: hjl.tools at gmail dot com Target Milestone: --- 1 static inline void pixel_avg( uint8_t *dst, int i_dst_stride, 2 uint8_t *src1, int i_src1_stride, 3 uint8_t *src2, int i_src2_stride, 4 int i_width, int i_height ) 5 { 6 for( int y = 0; y < i_height; y++ ) 7 { 8 for( int x = 0; x < i_width; x++ ) 9 dst[x] = ( src1[x] + src2[x] + 1 ) >> 1; 10 dst += i_dst_stride; 11 src1 += i_src1_stride; 12 src2 += i_src2_stride; 13 } 14 } If above code is in hot loop. if i_width value is between 16 and 32, -mprefer-vector-width=128 can provide ~6% performance improvement as compared to -mprefer-vector-width=256. i_width value must be at least 16 to trigger 128 bit vectorization at line 8. i_width value must be at least 32 to trigger 256 bit vectorization at line 8.