[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

linkw at gcc dot gnu.org Wed, 16 Sep 2020 03:04:46 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-09-16
             Status|UNCONFIRMED                 |ASSIGNED
     Ever confirmed|0                           |1

--- Comment #7 from Kewen Lin <linkw at gcc dot gnu.org> ---
Two questions in mind, need to dig into it further:
  1) from the assembly of scalar/vector code, I don't see any stores needed
into temp array d (array diff in pixel_sub_wxh), but when modeling we consider
the stores. On Power two vector stores take cost 2 while 16 scalar stores takes
cost 16, it seems wrong to cost model something useless. Later, for the vector
version we need 16 vector halfword extractions from these two halfword vectors,
while scalar version the values are just in GPR register, vector version looks
inefficient.
  2) on Power, the conversion from unsigned char to unsigned short is nop
conversion, when we counting scalar cost, it's counted, then add costs 32
totally onto scalar cost. Meanwhile, the conversion from unsigned short to
signed short should be counted but it's not (need to check why further).  The
nop conversion costing looks something we can handle in function
rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
tree_nop_conversion_p, but it's only for same mode/precision conversion. Will
find/check something else.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to