https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038
Hongtao.liu <crazylht at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crazylht at gmail dot com --- Comment #3 from Hongtao.liu <crazylht at gmail dot com> --- vectorizer saw 2 scalar loads + 2 bit_ops + 2 scalar stores vs 1 unaligned_load + 1 bit_op + 1 unaligned_store, only scale cost of bit_op doesn't help. In rtl level, we have 205(note 3 14 4 2 NOTE_INSN_DELETED) 206(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) 207(insn 7 4 8 2 (set (reg:V2QI 87 [ vect__20.19 ]) 208 (mem:V2QI (reg:DI 91) [0 MEM <const vector(2) unsigned char> [(const uint8_t *)b_11(D)]+0 S2 A8])) "test.c":31:1 1414 {*movv2qi_internal} 209 (expr_list:REG_DEAD (reg:DI 91) 210 (nil))) 211(insn 8 7 9 2 (set (reg:V2QI 88 [ vect__18.16 ]) 212 (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2) unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])) "test.c":31:1 1414 {*movv2qi_internal} 213 (expr_list:REG_EQUIV (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2) unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8]) 214 (nil))) 215(insn 9 8 10 2 (parallel [ 216 (set (reg:V2QI 89 [ vect__21.20 ]) 217 (xor:V2QI (reg:V2QI 87 [ vect__20.19 ]) 218 (reg:V2QI 88 [ vect__18.16 ]))) 219 (clobber (reg:CC 17 flags)) 220 ]) "test.c":31:1 1627 {xorv2qi3} 221 (expr_list:REG_DEAD (reg:V2QI 88 [ vect__18.16 ]) 222 (expr_list:REG_DEAD (reg:V2QI 87 [ vect__20.19 ]) 223 (expr_list:REG_UNUSED (reg:CC 17 flags) 224 (expr_list:REG_EQUIV (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2) unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8]) 225 (nil)))))) 226(insn 10 9 0 2 (set (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2) unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8]) 227 (reg:V2QI 89 [ vect__21.20 ])) "test.c":31:1 1414 {*movv2qi_internal} 228 (expr_list:REG_DEAD (reg:V2QI 89 [ vect__21.20 ]) if RA can allocate 87/88/89 into GPRs, it would same as non-vectorized version.