On Thu, Apr 3, 2014 at 10:11 PM, Eric Botcazou <ebotca...@adacore.com> wrote:
>> I find the GCC function simplify_subreg fails to simplify rtx (subreg:SI
>> (and:DI (reg/v:DI 115 [ a ]) (const_int 4294967295 [0xffffffff])) 4) to zero
>> during the fwprop1 pass, considering the fact that the high 32-bit part of
>> (a & 0xFFFFFFFF) is zero. This leads to some unnecessary multiplications
>> for high 32-bit part of the result of AND operation. The attached patch is
>> trying to improve simplify_rtx to handle such case. Other target like x86
>> seems hasn't such issue because it generates different RTX to handle 64bit
>> multiplication on a 32bit machine.
>
> See http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00073.html for another try,
> which led to the simplification in combine.c:combine_simplify_rtx line 5448.
>
> Your variant is both more general, because it isn't restricted to the lowpart,
> and less general, because it is artificially restricted to AND.
>
> Some remarks:
>  - this needs to be restricted to non-paradoxical subregs,
>  - you need to test HWI_COMPUTABLE_MODE_P (innermode),
>  - you need to test !side_effects_p (op).
>
> I think we need to find a common ground between Jakub's patch and yours and
> put a single transformation in simplify_subreg.
>
> --
> Eric Botcazou

Thanks for your review. Even without Jakub's patch, the combine pass
can simplify such subreg to zero. But the live range of destination
register causes combine pass to undo all attempts. Here is detailed
explanation, please note that the AND operation (and:DI (reg/v:DI 115
[ a ]) (const_int 4294967295 [0xffffffff])) from fwprop1 pass is
turned into ZERO_EXTEND operation (zero_extend:DI (reg:SI 0 r0 [ a ]))
in combine pass. The variable a is function arguments and occupies
register r0 and r1. Right before try_combine function we have below
three instructions:

(insn 8 4 10 2 (set (reg:DI 118 [ D.4091 ])
        (zero_extend:DI (reg:SI 0 r0 [ a ]))) 000.c:4 170 {zero_extendsidi2}
     (expr_list:REG_DEAD (reg:SI 0 r0)
        (nil)))

(insn 10 8 13 2 (set (reg:SI 121)
        (mult:SI (reg/v:SI 116 [ b ])
            (subreg:SI (reg:DI 118 [ D.4091 ]) 4))) 000.c:4 41 {*arm_mulsi3_v6}
     (nil))

(insn 13 10 14 2 (set (reg:DI 117 [ D.4091 ])
        (mult:DI (zero_extend:DI (subreg:SI (reg:DI 118 [ D.4091 ]) 0))
            (zero_extend:DI (reg/v:SI 116 [ b ])))) 000.c:4 60 {*umulsidi3_v6}
     (expr_list:REG_DEAD (reg:DI 118 [ D.4091 ])
        (expr_list:REG_DEAD (reg/v:SI 116 [ b ])
            (nil))))

Without any changes, current gcc can simplify the insn 10 to

(insn 10 8 13 2 (set (reg:SI 121)
        (const_int 0 [0])) 000.c:4 41 {*arm_mulsi3_v6}
     (nil))

This is what we wanted. Unfortunately the live range of register
(reg:DI 118) is extended to insn 13. Thus unable to remove insn 8. The
combine pass has to generate PARALLEL rtx to handle such case:

(parallel [
        (set (reg:SI 121)
            (const_int 0 [0]))
        (set (reg:DI 118 [ D.4091 ])
            (zero_extend:DI (reg:SI 0 r0 [ a ])))
    ])

There is no instruction patter to match this parallel rtx. This causes
combine pass to undo all attempts and roll back simplification to
subreg operand. Without such live range extension, everything works
fine. That's why such optimization can only happen in fwprop1 pass.

I made a typo in my previous changelog. My code are indeed for
simplify_subreg function. I updated my patch per your suggestions to
handle more general cases. Please review again. Thanks.

BR,
Terry
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 04af01e..1949edc 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -6099,6 +6099,22 @@ simplify_subreg (enum machine_mode outermode, rtx op,
        return CONST0_RTX (outermode);
     }
 
+  /* If we can make sure the required bits of subreg are zero
+     in subreg operand, we can return a zero.  Here is an example:
+     (subreg:SI (and:DI (reg:DI X) (const_int 0xFFFFFFFF)) 4).  */
+  if (HWI_COMPUTABLE_MODE_P (innermode)
+      && SCALAR_INT_MODE_P (innermode)
+      && GET_MODE_SIZE (innermode) > GET_MODE_SIZE (outermode)
+      && !side_effects_p (op))
+    {
+      unsigned int bitpos = subreg_lsb_1 (outermode, innermode, byte);
+      unsigned HOST_WIDE_INT nzmask = nonzero_bits (op, innermode);
+      unsigned HOST_WIDE_INT smask = GET_MODE_MASK (outermode);
+
+      if (((smask << bitpos) & nzmask) == 0)
+       return CONST0_RTX (outermode);
+    }
+
   if (SCALAR_INT_MODE_P (outermode)
       && SCALAR_INT_MODE_P (innermode)
       && GET_MODE_PRECISION (outermode) < GET_MODE_PRECISION (innermode)

Reply via email to