https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257
--- Comment #3 from Richard Biener ---
Confirmed. We fail to elide the 'pixel' temporary, that is, express
memcpy (, src_33, 6);
_1 = pixel.b;
_2 = pixel.g;
_3 = pixel.r;
in terms of loads from src. Then the backend intrinsic expanding to
a target builtin of course does not help things.
For the above we'd need SRA-like analysis, while VN could remat the loads
from src it lacks the global costing that would tell it that all uses of
pixel and thus the memcpy goes away.
We can fold the memcpy to
__MEM ((char * {ref-all})) = __MEM ((char * {ref-all})src_13);
with the following:
diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c
index 281839d4a73..750a5d884a7 100644
--- a/gcc/gimple-fold.c
+++ b/gcc/gimple-fold.c
@@ -1215,9 +1215,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
if (TREE_CODE (dest) == ADDR_EXPR
&& var_decl_component_p (TREE_OPERAND (dest, 0))
&& tree_int_cst_equal (TYPE_SIZE_UNIT (desttype), len)
- && dest_align >= TYPE_ALIGN (desttype)
- && (is_gimple_reg_type (desttype)
- || src_align >= TYPE_ALIGN (desttype)))
+ && dest_align >= TYPE_ALIGN (desttype))
destvar = fold_build2 (MEM_REF, desttype, dest, off0);
else if (TREE_CODE (src) == ADDR_EXPR
&& var_decl_component_p (TREE_OPERAND (src, 0))
and then end up with
pixel$r_3 = MEM [(char * {ref-all})src_34];
pixel$g_2 = MEM [(char * {ref-all})src_34 + 2B];
pixel$b_1 = MEM [(char * {ref-all})src_34 + 4B];
val_2.0_19 = (short int) pixel$b_1;
val_1.1_20 = (short int) pixel$g_2;
val_0.2_21 = (short int) pixel$r_3;
_22 = {val_0.2_21, val_1.1_20, val_2.0_19, 0, 0, 0, 0, 0};
_23 = __builtin_ia32_vcvtph2ps (_22);
but that doesn't help in the end. It does help vectorizing the loop
when you avoid the intrinsic by doing
for (unsigned x = 0; x < width; x += 1) {
const struct util_format_r16g16b16_float pixel;
memcpy(, src, sizeof pixel);
struct float3 r;// = _mesa_half3_to_float3(pixel.r, pixel.g, pixel.b);
r.f1 = pixel.r;
r.f2 = pixel.g;
r.f3 = pixel.b;
dst[0] = r.f1; /* r */
dst[1] = r.f2; /* g */
dst[2] = r.f3; /* b */
dst[3] = 1; /* a */
src += 6;
dst += 4;
}
then we vectorize it as
vect_pixel_r_25.30_283 = MEM [(char *
{ref-all})vectp_src.29_278];
vect_pixel_r_25.31_285 = MEM [(char *
{ref-all})vectp_src.29_278 + 16B];
vect_pixel_r_25.32_287 = MEM [(char *
{ref-all})vectp_src.29_278 + 32B];
vect__8.33_288 = [vec_unpack_float_lo_expr] vect_pixel_r_25.30_283;
vect__8.33_289 = [vec_unpack_float_hi_expr] vect_pixel_r_25.30_283;
vect__8.33_290 = [vec_unpack_float_lo_expr] vect_pixel_r_25.31_285;
vect__8.33_291 = [vec_unpack_float_hi_expr] vect_pixel_r_25.31_285;
vect__8.33_292 = [vec_unpack_float_lo_expr] vect_pixel_r_25.32_287;
vect__8.33_293 = [vec_unpack_float_hi_expr] vect_pixel_r_25.32_287;
but then we somehow mess up analysis of the stores going for hybrid SLP
(we split the store group). We could just leave the dst[3] stores
unvectorized ... but then we somehow decide to emit
vpmovzxwd (%rdx), %ymm11
...
vcvtdq2ps %ymm11, %ymm11
and thus use a different conversion path (not sure if it is worse in the
end, but ...).