[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #11 from pinskia at gcc dot gnu dot org 2007-06-18 05:28 --- The trunk no longer produces a loop so this has been fixed unless you can get a testcase where we still produce worse code. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Known to work|4.0.3 |4.0.3 4.3.0 Summary|[4.1/4.2/4.3 Regression] - |[4.1/4.2 Regression] -ftree- |ftree-ch generates worse|ch generates worse code |code| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #10 from mmitchel at gcc dot gnu dot org 2006-05-25 02:34 --- Will not be fixed in 4.1.1; adjust target milestone to 4.1.2. -- mmitchel at gcc dot gnu dot org changed: What|Removed |Added Target Milestone|4.1.1 |4.1.2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #9 from pinskia at gcc dot gnu dot org 2006-05-04 21:25 --- (In reply to comment #8) WRT this code generated by tree-ch: D.1305_41 = Int_Loc_3 + 1; if (Int_Loc_3 = D.1305_41) goto L0; else goto L2; AFAICT there's exactly one value for which the comparison can be false, IMO it would be better to test directly that value instead of generating a new SSA name and another expression. Well CH should not do this as it never sees two expressions together, only the one COND_EXPR. If we do a VRP after CH, it will not fix it currently either because VRP does not record that many symbolic ranges (I forgot that PR number, it was filed by me). If VRP did that and we added a VRP after CH but before IV-OPTS, maybe this wil fix itself. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #5 from dann at godzilla dot ics dot uci dot edu 2006-05-03 18:54 --- IMO Comment #4 does not look close enough at what is actually happening. IMO tree-ch is the root cause here. The code looks like this before .ch Before .ch goto bb 2 (L1); L0:; D.1301_54 = Int_Loc.0_4 * 200; D.1302_55 = (int[50] *) D.1301_54; D.1303_56 = Arr_2_Par_Ref_30 + D.1302_55; (*D.1303_56)[Int_Index_1] = Int_Loc_3; Int_Index_58 = Int_Index_1 + 1; # Int_Index_1 = PHI Int_Loc_3(0), Int_Index_58(1); L1:; D.1305_26 = Int_Loc_3 + 1; if (Int_Index_1 = D.1305_26) goto L0; else goto L2; L2:; after .ch it looks like this: D.1305_41 = Int_Loc_3 + 1; if (Int_Loc_3 = D.1305_41) goto L0; else goto L2; -- this just complicates the CFG. Look below to see what are the effects of doing this in later passes. Plus just look at the comparison ... # Int_Index_37 = PHI Int_Index_58(1), Int_Loc_3(0); L0:; D.1301_54 = Int_Loc.0_4 * 200; D.1302_55 = (int[50] *) D.1301_54; D.1303_56 = Arr_2_Par_Ref_30 + D.1302_55; (*D.1303_56)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; D.1305_26 = Int_Loc_3 + 1; if (D.1305_26 = Int_Index_58) goto L0; else goto L2; L2:; Given the above CFG, critical edge splitting transforms this into: D.1305_41 = Int_Loc_3 + 1; if (Int_Loc_3 = D.1305_41) goto L6; else goto L7; L7:; goto bb 2 (L2); L6:; # Int_Index_37 = PHI Int_Index_58(5), Int_Loc_3(3); L0:; D.1301_54 = Int_Loc.0_4 * 200; D.1302_55 = (int[50] *) D.1301_54; D.1303_56 = Arr_2_Par_Ref_30 + D.1302_55; (*D.1303_56)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; if (D.1305_41 = Int_Index_58) goto L8; else goto L9; L8:; goto bb 1 (L0); L9:; L2:; Given the above CFG PRE will dutifully fill with code a lot of the empty basic blocks: after pre D.1305_41 = Int_Loc_3 + 1; if (Int_Loc_3 = D.1305_41) goto L6; else goto L7; L7:; pretmp.34_45 = Int_Loc.0_4 * 200; pretmp.36_57 = (int[50] *) pretmp.34_45; pretmp.38_25 = Arr_2_Par_Ref_30 + pretmp.36_57; goto bb 2 (L2); L6:; pretmp.30_26 = Int_Loc.0_4 * 200; pretmp.31_19 = (int[50] *) pretmp.30_26; pretmp.32_1 = pretmp.31_19 + Arr_2_Par_Ref_30; # Int_Index_37 = PHI Int_Index_58(5), Int_Loc_3(3); L0:; D.1301_54 = pretmp.30_26; D.1302_55 = pretmp.31_19; D.1303_56 = pretmp.32_1; (*D.1303_56)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; if (D.1305_41 = Int_Index_58) goto L8; else goto L9; L8:; goto bb 1 (L0); L9:; # prephitmp.39_23 = PHI D.1303_56(6), pretmp.38_25(4); # prephitmp.37_53 = PHI D.1302_55(6), pretmp.36_57(4); # prephitmp.35_49 = PHI D.1301_54(6), pretmp.34_45(4); L2:; Now when using -fno-tree-ch before critical edge splitting the code looks like this: goto bb 2 (L1); L0:; D.1301_54 = Int_Loc.0_4 * 200; D.1302_55 = (int[50] *) D.1301_54; D.1303_56 = Arr_2_Par_Ref_30 + D.1302_55; (*D.1303_56)[Int_Index_1] = Int_Loc_3; Int_Index_58 = Int_Index_1 + 1; # Int_Index_1 = PHI Int_Loc_3(0), Int_Index_58(1); L1:; D.1305_26 = Int_Loc_3 + 1; if (Int_Index_1 = D.1305_26) goto L0; else goto L2; L2:; after crited it looks like this: (i.e. no change) goto bb 2 (L1); L0:; D.1301_54 = Int_Loc.0_4 * 200; D.1302_55 = (int[50] *) D.1301_54; D.1303_56 = Arr_2_Par_Ref_30 + D.1302_55; (*D.1303_56)[Int_Index_1] = Int_Loc_3; Int_Index_58 = Int_Index_1 + 1; # Int_Index_1 = PHI Int_Loc_3(0), Int_Index_58(1); L1:; D.1305_26 = Int_Loc_3 + 1; if (Int_Index_1 = D.1305_26) goto L0; else goto L2; L2:; and after PRE goto bb 2 (L1); L0:; D.1301_54 = pretmp.31_49; D.1302_55 = pretmp.32_45; D.1303_56 = pretmp.33_41; (*D.1303_56)[Int_Index_1] = Int_Loc_3; Int_Index_58 = Int_Index_1 + 1; # Int_Index_1 = PHI Int_Loc_3(0), Int_Index_58(1); L1:; D.1305_26 = pretmp.30_19; if (Int_Index_1 = D.1305_26) goto L0; else goto L2; L2:; -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
Re: [Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #5 from dann at godzilla dot ics dot uci dot edu 2006-05-03 18:54 --- IMO Comment #4 does not look close enough at what is actually happening. IMO tree-ch is the root cause here. Given the above CFG, critical edge splitting transforms this into: Given the above CFG PRE will dutifully fill with code a lot of the empty basic blocks: None of the above issues are the real issue. TREE CH is doing the correct thing simplifying the loop. PRE is doing the correct thing by getting rid of redundants. The main issue is really the RA not being so good. -- Pinski
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #6 from pinskia at physics dot uc dot edu 2006-05-03 19:00 --- Subject: Re: [4.1/4.2 Regression] -ftree-ch generates worse code --- Comment #5 from dann at godzilla dot ics dot uci dot edu 2006-05-03 18:54 --- IMO Comment #4 does not look close enough at what is actually happening. IMO tree-ch is the root cause here. Given the above CFG, critical edge splitting transforms this into: Given the above CFG PRE will dutifully fill with code a lot of the empty basic blocks: None of the above issues are the real issue. TREE CH is doing the correct thing simplifying the loop. PRE is doing the correct thing by getting rid of redundants. The main issue is really the RA not being so good. -- Pinski -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #7 from steven at gcc dot gnu dot org 2006-05-03 21:33 --- Re. comment #5, user code could also have a CFG like that, so we should handle this case properly (and we do, tree-ch is doing the right thing afaict). Re. comment #6, I don't see what the register allocator has to do with this at all. The bottom line is that for the case where we produce good code, IVOPTs selects a simple addressing mode and produces a simple loop exit condition; and for the complicated code, IVOPTs picks an addressing mode that requires a lea and an extra register. Look back at that loop for a moment. With tree-ch, ignoring dead code (the sets to SSA names 5[456] are dead!), the .cunroll dump (i.e. just before IVOPTs) looks like this: # Int_Index_37 = PHI Int_Index_58(6), Int_Loc_3(4); L0:; (*pretmp.28_49)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; if (D.1563_41 = Int_Index_58) goto L8; else goto L9; L8:; goto bb 5 (L0); That looks rather nice to me. But just after IVOPTs (in the .ivopts dump) we have turned that simple nice code into this mess: # ivtmp.38_26 = PHI ivtmp.38_35(6), 0(4); L0:; D.1622_34 = (int *) pretmp.28_49; D.1623_33 = (int *) Int_1_Par_Val_2; D.1624_22 = (int *) ivtmp.38_26; D.1625_21 = D.1623_33 + D.1624_22; MEM[base: D.1622_34, index: D.1625_21, step: 4B, offset: 20B] = Int_Loc_3; ivtmp.38_35 = ivtmp.38_26 + 1; D.1626_20 = (unsigned int) Int_1_Par_Val_2; D.1627_17 = D.1626_20 + ivtmp.38_35; D.1628_16 = D.1627_17 + 5; Int_Index_15 = (One_Fifty) D.1628_16; if (D.1563_41 = Int_Index_15) goto L8; else goto L9; L8:; goto bb 5 (L0); If this is caused by the register allocator, I'd like to know why you'd think that. And if this is the doing of tree-ch, then I'd like to know what you expect tree-ch to do instead. But as far as I can tell, this is just a very poor choice by IVOPTs. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #8 from dann at godzilla dot ics dot uci dot edu 2006-05-03 21:53 --- WRT this code generated by tree-ch: D.1305_41 = Int_Loc_3 + 1; if (Int_Loc_3 = D.1305_41) goto L0; else goto L2; AFAICT there's exactly one value for which the comparison can be false, IMO it would be better to test directly that value instead of generating a new SSA name and another expression. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #4 from steven at gcc dot gnu dot org 2006-05-02 17:38 --- The inner loop in the .cunroll, .ivopts and .final_cleanup with GVN-PRE disabled look like this: # Int_Index_37 = PHI Int_Index_58(5), Int_Loc_3(3); L0:; (*D.1561_56)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; if (D.1563_41 = Int_Index_58) goto L8; else goto L2; L8:; goto bb 4 (L0); and # ivtmp.34_26 = PHI ivtmp.34_19(5), ivtmp.34_1(3); # Int_Index_37 = PHI Int_Index_58(5), Int_Loc_3(3); L0:; D.1613_59 = (int *) ivtmp.34_26; MEM[base: D.1613_59, offset: 20B] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; ivtmp.34_19 = ivtmp.34_26 + 4B; if (D.1563_41 = Int_Index_58) goto L8; else goto L2; L8:; goto bb 4 (L0); and L0:; MEM[base: (int *) ivtmp.34, offset: 20B] = Int_Loc; Int_Index = Int_Index + 1; ivtmp.34 = ivtmp.34 + 4B; if (D.1563 = Int_Index) goto L0; else goto L2; which compiles to: .L4: addl$1, %eax movl%ecx, 20(%edx) addl$4, %edx cmpl%eax, %ebx jge .L4 With PRE enabled, we get this: # Int_Index_37 = PHI Int_Index_58(6), Int_Loc_3(4); L0:; D.1559_54 = pretmp.27_59; D.1560_55 = pretmp.28_45; D.1561_56 = pretmp.28_49; (*pretmp.28_49)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; if (D.1563_41 = Int_Index_58) goto L8; else goto L9; L8:; goto bb 5 (L0); and # ivtmp.38_26 = PHI ivtmp.38_35(6), 0(4); L0:; D.1559_54 = pretmp.27_59; D.1560_55 = pretmp.28_45; D.1561_56 = pretmp.28_49; D.1622_34 = (int *) pretmp.28_49; D.1623_33 = (int *) Int_1_Par_Val_2; D.1624_22 = (int *) ivtmp.38_26; D.1625_21 = D.1623_33 + D.1624_22; MEM[base: D.1622_34, index: D.1625_21, step: 4B, offset: 20B] = Int_Loc_3; ivtmp.38_35 = ivtmp.38_26 + 1; D.1626_20 = (unsigned int) Int_1_Par_Val_2; D.1627_17 = D.1626_20 + ivtmp.38_35; D.1628_16 = D.1627_17 + 5; Int_Index_15 = (One_Fifty) D.1628_16; if (D.1563_41 = Int_Index_15) goto L8; else goto L9; L8:; goto bb 5 (L0); and L0:; MEM[base: (int *) prephitmp.33, index: (int *) Int_1_Par_Val + (int *) ivtmp.38, step: 4B, offset: 20B] = Int_Loc; ivtmp.38 = ivtmp.38 + 1; if ((One_Fifty) ((unsigned int) Int_1_Par_Val + 5 + ivtmp.38) = D.1563) goto L0; else goto L2; and from there: .L5: leal(%edi,%edx), %eax addl$1, %edx movl%ecx, 20(%ebx,%eax,4) leal(%ecx,%edx), %eax cmpl%esi, %eax jle .L5 So it's a mix of PRE and IVOPTs that gives this strange code. BTW regarding Its strange that tree-ch messes up, please next time don't blame random passes if you don't fully analyze the problem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
-- mmitchel at gcc dot gnu dot org changed: What|Removed |Added Priority|P3 |P2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
-- pinskia at gcc dot gnu dot org changed: What|Removed |Added Severity|normal |minor Target Milestone|--- |4.1.1 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
Re: [Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
Compare pretmp.28_49 with pretmp.32_11, why are the arguments in a different order? Is there something unstable in the PRE algorithm? No, we just call fold on the expressions we build, and whatever it gives us, we use :)
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #3 from dberlin at gcc dot gnu dot org 2006-03-31 22:41 --- Subject: Re: [4.1/4.2 Regression] -ftree-ch generates worse code Compare pretmp.28_49 with pretmp.32_11, why are the arguments in a different order? Is there something unstable in the PRE algorithm? No, we just call fold on the expressions we build, and whatever it gives us, we use :) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #1 from rguenth at gcc dot gnu dot org 2006-03-30 16:25 --- Note that this may be also PRE confusing SCEV in presence of loop headers. I.e. a sort of dup of PR26939. Confirmed though. A regression from 4.0.3, which is also fine. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added BugsThisDependsOn||26939 Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 GCC target triplet|i686-pc-linux-gnu | Keywords||missed-optimization Known to work||4.0.3 Last reconfirmed|-00-00 00:00:00 |2006-03-30 16:25:17 date|| Summary|-ftree-ch generates worse |[4.1/4.2 Regression] -ftree- |code|ch generates worse code http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944
[Bug tree-optimization/26944] [4.1/4.2 Regression] -ftree-ch generates worse code
--- Comment #2 from dann at godzilla dot ics dot uci dot edu 2006-03-30 16:43 --- (In reply to comment #1) Note that this may be also PRE confusing SCEV in presence of loop headers. Talking about PRE, here's a maybe interesting observation in the PRE dump: L7:; pretmp.30_53 = Int_Loc.0_4 * 200; pretmp.32_23 = (int[50] *) pretmp.30_53; pretmp.32_11 = pretmp.32_23 + Arr_2_Par_Ref_30; goto bb 4 (L2); L6:; pretmp.27_59 = Int_Loc.0_4 * 200; pretmp.28_45 = (int[50] *) pretmp.27_59; pretmp.28_49 = Arr_2_Par_Ref_30 + pretmp.28_45; # Int_Index_37 = PHI Int_Index_58(7), Int_Loc_3(5); L0:; D.1544_54 = pretmp.27_59; D.1545_55 = pretmp.28_45; D.1546_56 = pretmp.28_49; (*D.1546_56)[Int_Index_37] = Int_Loc_3; Int_Index_58 = Int_Index_37 + 1; if (D.1548_41 = Int_Index_58) goto L8; else goto L9; L8:; goto bb 3 (L0); L9:; # prephitmp.33_40 = PHI D.1546_56(8), pretmp.32_11(6); # prephitmp.33_18 = PHI D.1545_55(8), pretmp.32_23(6); # prephitmp.31_25 = PHI D.1544_54(8), pretmp.30_53(6); Compare pretmp.28_49 with pretmp.32_11, why are the arguments in a different order? Is there something unstable in the PRE algorithm? One has to wonder what are the tree-ch effects on more complex loops. It might be interesting test SPEC with and without tree-ch... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26944