https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
Kewen Lin <linkw at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2020-09-16 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 --- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> --- Two questions in mind, need to dig into it further: 1) from the assembly of scalar/vector code, I don't see any stores needed into temp array d (array diff in pixel_sub_wxh), but when modeling we consider the stores. On Power two vector stores take cost 2 while 16 scalar stores takes cost 16, it seems wrong to cost model something useless. Later, for the vector version we need 16 vector halfword extractions from these two halfword vectors, while scalar version the values are just in GPR register, vector version looks inefficient. 2) on Power, the conversion from unsigned char to unsigned short is nop conversion, when we counting scalar cost, it's counted, then add costs 32 totally onto scalar cost. Meanwhile, the conversion from unsigned short to signed short should be counted but it's not (need to check why further). The nop conversion costing looks something we can handle in function rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function tree_nop_conversion_p, but it's only for same mode/precision conversion. Will find/check something else.