> target_insn_cost is used to prevent rpad optimization to be restored by 
> late_combine1, looks like it's not sufficient for size_cost.
> 
> 21804static int
> 21805ix86_insn_cost (rtx_insn *insn, bool speed)
> 21806{
> 21807  int insn_cost = 0;
> 21808  /* Add extra cost to avoid post_reload late_combine revert
> 21809     the optimization did in pass_rpad.  */
> 21810  if (reload_completed
> 21811      && ix86_rpad_gate ()
> 21812      && recog_memoized (insn) >= 0
> 21813      && get_attr_avx_partial_xmm_update (insn)
> 21814      == AVX_PARTIAL_XMM_UPDATE_TRUE)
> 21815    insn_cost += COSTS_N_INSNS (3);
> 21816
> 21817  return insn_cost + pattern_cost (PATTERN (insn), speed);
> 21818}

This is well-hidden.  I was looking for that in xi86_rtl_costs but did
not notice this one.

Indeed with the patch instruction seem to cost 8, so 3 is too low.
> 
> And we can use vec_series_lowpart_p instead of only check 1 element.

This sadly does not work.

vec_series_lowpart_p calls can_cange_mode_class on ALL_REGS and this
return false since x87 does not support subregs at all.   So it is
constant false on x86 targets.  This should be fixed too, but
incrementally.  I added FIXME comment for that and will check if we can
restrict the early exit only to modes that we can store in x87
registers.

I also had to increase cost to 1.  While the argument that these
lowparts are essentially equivalent to subregs that cost 0 is OK, we are
special casing subregs in many parts of middle-end and we don't handle
lowparts that well. Cost of 0 disables CSE and that in turn leads to
worse code as described in the comment.

I also noticed that costing of VEC_* does not recurse to parameters
which makes it cost i.e. memory operand same as register operand. Also
something I plan to track incrementally.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

        PRPR target/119900
        * config/i386/i386.cc (ix86_can_change_mode_class): Add TODO
        comment.
        (ix86_rtx_costs): Make VEC_SELECT equivalent to SUBREG cost 1.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 2f840338138..98c319cbd05 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20973,7 +20973,11 @@ ix86_can_change_mode_class (machine_mode from, 
machine_mode to,
     return true;
 
   /* x87 registers can't do subreg at all, as all values are reformatted
-     to extended precision.  */
+     to extended precision.
+
+     ??? middle-end queries mode changes for ALL_REGS and this makes
+     vec_series_lowpart_p to always return false.  We probably should
+     restrict this to modes supported by i387 and check if it is enabled.  */
   if (MAYBE_FLOAT_CLASS_P (regclass))
     return false;
 
@@ -22751,13 +22755,41 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
        }
       return false;
 
-    case VEC_SELECT:
     case VEC_CONCAT:
       /* ??? Assume all of these vector manipulation patterns are
         recognizable.  In which case they all pretty much have the
-        same cost.  */
+        same cost.
+        ??? We should still recruse when computing cost.  */
      *total = cost->sse_op;
      return true;
+
+    case VEC_SELECT:
+     /* Special case extracting lower part from the vector.
+       This by itself needs to code and most of SSE/AVX instructions have
+       packed and single forms where the single form may be represented
+       by such VEC_SELECT.
+
+       Use cost 1 (despite the fact that functionally equivalent SUBREG has
+       cost 0).  Making VEC_SELECT completely free, for example instructs CSE
+       to forward propagate VEC_SELECT into
+
+          (set (reg eax) (reg src))
+
+       which then prevents fwprop and combining. See i.e.
+       gcc.target/i386/pr91103-1.c.
+
+       ??? rtvec_series_p test should be, for valid patterns, equivalent to
+       vec_series_lowpart_p but is not, since the latter calls
+       can_cange_mode_class on ALL_REGS and this return false since x87 does
+       not support subregs at all.  */
+     if (rtvec_series_p (XVEC (XEXP (x, 1), 0), 0))
+       *total = rtx_cost (XEXP (x, 0), GET_MODE (XEXP (x, 0)),
+                         outer_code, opno, speed) + 1;
+     else
+       /* ??? We should still recruse when computing cost.  */
+       *total = cost->sse_op;
+     return true;
+
     case VEC_DUPLICATE:
       *total = rtx_cost (XEXP (x, 0),
                         GET_MODE (XEXP (x, 0)),
@@ -22775,6 +22807,7 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
       if (TARGET_AVX512F && register_operand (mask, GET_MODE (mask)))
        *total = rtx_cost (XEXP (x, 0), mode, outer_code, opno, speed);
       else
+       /* ??? We should still recruse when computing cost.  */
        *total = cost->sse_op;
       return true;
 

Reply via email to