https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102906
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P1 Last reconfirmed| |2021-11-09 Status|UNCONFIRMED |NEW Component|testsuite |tree-optimization Target| |arm Keywords| |missed-optimization Ever confirmed|0 |1 --- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Aldy Hernandez from comment #10) > (In reply to Christophe Lyon from comment #9) > > (In reply to Aldy Hernandez from comment #7) > > > > then size ivopts-4.o: > > > > text data bss dec hex filename > > > > 38 0 0 38 26 ivopts-4.o > > > > where the testcase expects text <= 36 > > > > > > Ohhhhh, this is an object size regression? This test seems very fragile. > > > Jump threading will alter code size, so any change in the threading rules > > > will likely have an effect on code size. I suggest you add > > > -fno-thread-jumps to the test and adjust the object-size test accordingly. > > > > I tried that, it doesn't change the generated code. > > The difference from before the commit til now is that there was a threading > path that was valid but is now disallowed. So adding -fno-thread-jumps > won't have any effect since it won't cause the disallowed threading path to > reappear. What I'm saying is that the test should be calibrated to the new > normal. > > Of course, it could be that another pass should pick up the slack here, or > that the restriction is too strict. Richi, do you have some insight here? > > At least for -Os -mthumb -mfloat-abi=hard -mfpu=neon -mtls-dialect=gnu > -mlibarch=armv7-a+mp+sec+neon-fp16 -march=armv7-a+mp+sec+neon-fp16, the > difference in the threader is that we used to thread 6->4->3 in DOM2, but we > no longer do so because doing so would rotate the loop: > > a.c.127t.dom2:Path rotates loop: Cancelling jump thread: (6, 4) incoming > edge; (4, 3) normal; > > I still think testing for some magic code size is fragile at best. So it seems we intentionally allowed some loop header copying done by the threader based on the idea that it knows when it can do so without a code size penalty. The loop header copying pass has static bool should_duplicate_loop_header_p (basic_block header, class loop *loop, int *limit) { gimple_stmt_iterator bsi; gcc_assert (!header->aux); /* Loop header copying usually increases size of the code. This used not to be true, since quite often it is possible to verify that the condition is satisfied in the first iteration and therefore to eliminate it. Jump threading handles these cases now. */ if (optimize_loop_for_size_p (loop) && !loop->force_vectorize) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, " Not duplicating bb %i: optimizing for size.\n", header->index); return false; and indeed for the testcase we have Analyzing loop 1 Loop 1 is not do-while loop: latch is not empty. Not duplicating bb 4: optimizing for size. and were probably expecting jump-threading to rotate the loop. Now, the forward threader is set up to eventually not destroy loop info but then loop-header copying might be able to either use ranger to tell the looping condition is always true [and the block to copy is empty] or loop header copying could use the jump threading path code which might be able to determine all this(?) That said, for GCC 12 we probably need to look to allow those threadings, but only once and for the forward threader? The early threader doesn't seem to discover this opportunity for some reason. At the point we run jt_path_registry::cancel_invalid_paths do we have any ideas about the size of the copying we do? So the pattern we'd allow is a threading with entry being the single entry edge into the loop and the exit being the other edge from the single_exit source block of it. I wonder why the late DOM threading doesn't catch this? We'd allow the threading there with if (cfun->curr_properties & PROP_loop_opts_done) return false; the late pass sees <bb 2> [local count: 200189151]: if (n_8(D) > 0) goto <bb 3>; [59.00%] else goto <bb 6>; [41.00%] <bb 3> [local count: 118111600]: ivtmp.9_16 = (unsigned int) array_9(D); _17 = (unsigned int) n_8(D); _18 = _17 * 4; _20 = ivtmp.9_16 + _18; goto <bb 5>; [100.00%] <bb 5> [local count: 1073741824]: # sum_5 = PHI <0(3), sum_11(4)> # ivtmp.9_14 = PHI <ivtmp.9_16(3), ivtmp.9_15(4)> if (ivtmp.9_14 != _20) goto <bb 4>; [89.00%] else goto <bb 6>; [11.00%] so appearantly with IVCANON introducing a canonical IV ranger cannot relate ivtmp.9_16 and _20 using the n_8(D) > 0 condition (OK, that looks mightly complex, but mostly because n_8(D) * 4 is involved which we don't know it doesn't overflow to zero). diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 8aac733ac25..3670ec5b8d0 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2815,6 +2815,20 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) && flow_loop_nested_p (exit->dest->loop_father, exit->src->loop_father)) return false; + /* Allow loop header copying when we know the entry condition is + optimized away. The loop header copying pass relies on those + being done here when optimizing for size. */ + edge e; + if (entry->dest == exit->src + && entry->dest->loop_father->header == entry->dest + /* Alternatively check for a single non-backedge pred of entry->dest. */ + && loops_state_satisfies_p (cfun, LOOPS_HAVE_PREHEADERS) + && (e = single_exit (entry->dest->loop_father)) + && exit != e + && exit->src == e->src + && optimize_loop_for_size_p (entry->dest->loop_father)) + return false; + if (cfun->curr_properties & PROP_loop_opts_done) return false; fixes the missing jump threading. Possibly the predicate should be put next to should_duplicate_loop_header_p in tree-ssa-loop-ch.c.