[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

--- Comment #6 from Hongtao.liu  ---
> const __m128i in = _mm_setr_epi16(val_0, val_1, val_2, 0, 0, 0, 0, 0);

in ix86_expand_vector_init, we can generate asm like 


  vmovd val_0, %xmm0
  pinsrw $1, val_1, %xmm0
  pinsrw $2, val_2, %xmm0

and let rtl's optimization "merge" val_0 and val_1 since they come from
contiguous memory)

[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

--- Comment #5 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Hongtao.liu from comment #2)
> > for vec_init, if higher part is zero, we can use vmovd/vmovq instead of
> > vector concat.
> 
> That is related to PR 94680 if not the same.

Yes, but in different phase.

[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

--- Comment #4 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #2)
> for vec_init, if higher part is zero, we can use vmovd/vmovq instead of
> vector concat.

That is related to PR 94680 if not the same.

[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

--- Comment #3 from Richard Biener  ---
Confirmed.  We fail to elide the 'pixel' temporary, that is, express

  memcpy (, src_33, 6);
  _1 = pixel.b;
  _2 = pixel.g;
  _3 = pixel.r;

in terms of loads from src.  Then the backend intrinsic expanding to
a target builtin of course does not help things.

For the above we'd need SRA-like analysis, while VN could remat the loads
from src it lacks the global costing that would tell it that all uses of
pixel and thus the memcpy goes away.

We can fold the memcpy to

  __MEM  ((char * {ref-all})) = __MEM  ((char * {ref-all})src_13);

with the following:

diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c
index 281839d4a73..750a5d884a7 100644
--- a/gcc/gimple-fold.c
+++ b/gcc/gimple-fold.c
@@ -1215,9 +1215,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
   if (TREE_CODE (dest) == ADDR_EXPR
  && var_decl_component_p (TREE_OPERAND (dest, 0))
  && tree_int_cst_equal (TYPE_SIZE_UNIT (desttype), len)
- && dest_align >= TYPE_ALIGN (desttype)
- && (is_gimple_reg_type (desttype)
- || src_align >= TYPE_ALIGN (desttype)))
+ && dest_align >= TYPE_ALIGN (desttype))
destvar = fold_build2 (MEM_REF, desttype, dest, off0);
   else if (TREE_CODE (src) == ADDR_EXPR
   && var_decl_component_p (TREE_OPERAND (src, 0))

and then end up with

  pixel$r_3 = MEM  [(char * {ref-all})src_34];
  pixel$g_2 = MEM  [(char * {ref-all})src_34 + 2B];
  pixel$b_1 = MEM  [(char * {ref-all})src_34 + 4B];
  val_2.0_19 = (short int) pixel$b_1;
  val_1.1_20 = (short int) pixel$g_2;
  val_0.2_21 = (short int) pixel$r_3;
  _22 = {val_0.2_21, val_1.1_20, val_2.0_19, 0, 0, 0, 0, 0};
  _23 = __builtin_ia32_vcvtph2ps (_22);

but that doesn't help in the end.  It does help vectorizing the loop
when you avoid the intrinsic by doing

   for (unsigned x = 0; x < width; x += 1) {
const struct util_format_r16g16b16_float pixel;
memcpy(, src, sizeof pixel);

struct float3 r;// = _mesa_half3_to_float3(pixel.r, pixel.g, pixel.b);
r.f1 = pixel.r;
r.f2 = pixel.g;
r.f3 = pixel.b;
dst[0] = r.f1; /* r */
dst[1] = r.f2; /* g */
dst[2] = r.f3; /* b */
dst[3] = 1; /* a */

src += 6;
dst += 4;
   }

then we vectorize it as

  vect_pixel_r_25.30_283 = MEM  [(char *
{ref-all})vectp_src.29_278];
  vect_pixel_r_25.31_285 = MEM  [(char *
{ref-all})vectp_src.29_278 + 16B];
  vect_pixel_r_25.32_287 = MEM  [(char *
{ref-all})vectp_src.29_278 + 32B];
  vect__8.33_288 = [vec_unpack_float_lo_expr] vect_pixel_r_25.30_283;
  vect__8.33_289 = [vec_unpack_float_hi_expr] vect_pixel_r_25.30_283;
  vect__8.33_290 = [vec_unpack_float_lo_expr] vect_pixel_r_25.31_285;
  vect__8.33_291 = [vec_unpack_float_hi_expr] vect_pixel_r_25.31_285;
  vect__8.33_292 = [vec_unpack_float_lo_expr] vect_pixel_r_25.32_287;
  vect__8.33_293 = [vec_unpack_float_hi_expr] vect_pixel_r_25.32_287;

but then we somehow mess up analysis of the stores going for hybrid SLP
(we split the store group).  We could just leave the dst[3] stores
unvectorized ... but then we somehow decide to emit

vpmovzxwd   (%rdx), %ymm11
...
vcvtdq2ps   %ymm11, %ymm11

and thus use a different conversion path (not sure if it is worse in the
end, but ...).

[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #2 from Hongtao.liu  ---
for vec_init, if higher part is zero, we can use vmovd/vmovq instead of vector
concat.

[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug target/100257] poor codegen with vcvtph2ps / stride of 6

2021-04-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

--- Comment #1 from Andrew Pinski  ---
Looks like a few missed optimizations at the tree level (and a target issue of
the store):
  memcpy (, src_33, 6);
  _1 = pixel.b;
  _2 = pixel.g;
  _3 = pixel.r;
  val_2.0_21 = (short int) _1;
  val_1.1_22 = (short int) _2;
  val_0.2_23 = (short int) _3;
  _24 = {val_0.2_23, val_1.1_22, val_2.0_21, 0, 0, 0, 0, 0};
  _25 = __builtin_ia32_vcvtph2ps (_24);
  _14 = BIT_FIELD_REF <_25, 64, 0>;
  _28 = BIT_FIELD_REF <_25, 32, 64>;
  MEM  [(float *)dst_34] = _14;
  MEM[(float *)dst_34 + 8B] = _28;
  MEM[(float *)dst_34 + 12B] = 1.0e+0;


The store issue is now PR 100258.
This is more about the missed optimization of the first part, the conversion.