On Thu, Apr 3, 2014 at 10:11 PM, Eric Botcazou <ebotca...@adacore.com> wrote: >> I find the GCC function simplify_subreg fails to simplify rtx (subreg:SI >> (and:DI (reg/v:DI 115 [ a ]) (const_int 4294967295 [0xffffffff])) 4) to zero >> during the fwprop1 pass, considering the fact that the high 32-bit part of >> (a & 0xFFFFFFFF) is zero. This leads to some unnecessary multiplications >> for high 32-bit part of the result of AND operation. The attached patch is >> trying to improve simplify_rtx to handle such case. Other target like x86 >> seems hasn't such issue because it generates different RTX to handle 64bit >> multiplication on a 32bit machine. > > See http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00073.html for another try, > which led to the simplification in combine.c:combine_simplify_rtx line 5448. > > Your variant is both more general, because it isn't restricted to the lowpart, > and less general, because it is artificially restricted to AND. > > Some remarks: > - this needs to be restricted to non-paradoxical subregs, > - you need to test HWI_COMPUTABLE_MODE_P (innermode), > - you need to test !side_effects_p (op). > > I think we need to find a common ground between Jakub's patch and yours and > put a single transformation in simplify_subreg. > > -- > Eric Botcazou
Thanks for your review. Even without Jakub's patch, the combine pass can simplify such subreg to zero. But the live range of destination register causes combine pass to undo all attempts. Here is detailed explanation, please note that the AND operation (and:DI (reg/v:DI 115 [ a ]) (const_int 4294967295 [0xffffffff])) from fwprop1 pass is turned into ZERO_EXTEND operation (zero_extend:DI (reg:SI 0 r0 [ a ])) in combine pass. The variable a is function arguments and occupies register r0 and r1. Right before try_combine function we have below three instructions: (insn 8 4 10 2 (set (reg:DI 118 [ D.4091 ]) (zero_extend:DI (reg:SI 0 r0 [ a ]))) 000.c:4 170 {zero_extendsidi2} (expr_list:REG_DEAD (reg:SI 0 r0) (nil))) (insn 10 8 13 2 (set (reg:SI 121) (mult:SI (reg/v:SI 116 [ b ]) (subreg:SI (reg:DI 118 [ D.4091 ]) 4))) 000.c:4 41 {*arm_mulsi3_v6} (nil)) (insn 13 10 14 2 (set (reg:DI 117 [ D.4091 ]) (mult:DI (zero_extend:DI (subreg:SI (reg:DI 118 [ D.4091 ]) 0)) (zero_extend:DI (reg/v:SI 116 [ b ])))) 000.c:4 60 {*umulsidi3_v6} (expr_list:REG_DEAD (reg:DI 118 [ D.4091 ]) (expr_list:REG_DEAD (reg/v:SI 116 [ b ]) (nil)))) Without any changes, current gcc can simplify the insn 10 to (insn 10 8 13 2 (set (reg:SI 121) (const_int 0 [0])) 000.c:4 41 {*arm_mulsi3_v6} (nil)) This is what we wanted. Unfortunately the live range of register (reg:DI 118) is extended to insn 13. Thus unable to remove insn 8. The combine pass has to generate PARALLEL rtx to handle such case: (parallel [ (set (reg:SI 121) (const_int 0 [0])) (set (reg:DI 118 [ D.4091 ]) (zero_extend:DI (reg:SI 0 r0 [ a ]))) ]) There is no instruction patter to match this parallel rtx. This causes combine pass to undo all attempts and roll back simplification to subreg operand. Without such live range extension, everything works fine. That's why such optimization can only happen in fwprop1 pass. I made a typo in my previous changelog. My code are indeed for simplify_subreg function. I updated my patch per your suggestions to handle more general cases. Please review again. Thanks. BR, Terry
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 04af01e..1949edc 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -6099,6 +6099,22 @@ simplify_subreg (enum machine_mode outermode, rtx op, return CONST0_RTX (outermode); } + /* If we can make sure the required bits of subreg are zero + in subreg operand, we can return a zero. Here is an example: + (subreg:SI (and:DI (reg:DI X) (const_int 0xFFFFFFFF)) 4). */ + if (HWI_COMPUTABLE_MODE_P (innermode) + && SCALAR_INT_MODE_P (innermode) + && GET_MODE_SIZE (innermode) > GET_MODE_SIZE (outermode) + && !side_effects_p (op)) + { + unsigned int bitpos = subreg_lsb_1 (outermode, innermode, byte); + unsigned HOST_WIDE_INT nzmask = nonzero_bits (op, innermode); + unsigned HOST_WIDE_INT smask = GET_MODE_MASK (outermode); + + if (((smask << bitpos) & nzmask) == 0) + return CONST0_RTX (outermode); + } + if (SCALAR_INT_MODE_P (outermode) && SCALAR_INT_MODE_P (innermode) && GET_MODE_PRECISION (outermode) < GET_MODE_PRECISION (innermode)