Hi, while looking into problems of current and pretty-ipa's inlining heuristics implementation, I noticed that we have relatively important problems with EH overhead that confuse inliner code size metrics.
Looking deeper into EH problems, I think main issues are: - Inliner tends to produce exponential amount of destructor calls on programs with high abstraction penalty (pooma/boost/DLV all seem to have this). Every scope block such as { class a a;.....} contains implicit cleanup. Destructor of class A is called twice, once for ned of block, once for cleanup. Now it is common for destructors to call destuctors of contained objects and local iterators, this all happens in same way causing EH cleanup code that often exceed main function body. - Not inlining destructors in cleanups cause otherwise fully inlined object to miss SRA/store sinking and other optimizations. - We tend to block optimization because variable along hot path has abnormal PHI. Interestingly this usually limits to creating more copies, PHIs, conditionals and BBs that gets almost completely cleaned up later on RTL land, but I don't think it is good excuse to ignore these ;) - Cleanups calling destructors are often fully removable but we don't do that very well. We remove completely dead destructor by means of new local-pureconst pass. As soon as destructor has some stores or loops, we realize the fact that they are dead very late in compilation. Also destructors calling delete() on some fields are not simplified. This is partly doable via Martin's IPA-SRA. I thus started to look into improvements in EH machinery and elimination of code in the cleanups What is already done on mainline: - New ehcleanup pass: when EH handler becomes empty by local pureconst+DCE+DSE+complette unrolling combo, we can safely eliminate it. This further eliminate local nothrow region handlers that are numberous. - There is new local pureconst pass done during early optimization. It tries to prove stuff nothrow and pure. This leads to DCEing the destructors before they are accounted in inline metricts. - Some of bits from RTL EH lowering are gone now. More should come We really have here partial transition from RTL EH code to gimple. This will hopefully simplify RTL EH code to basic part adding landing pads. What is already done on pretty-ipa: - Inliner herusitics can be bit smarter about code size estimates on how code will look after inlining. In particular MUST_NOT_THROW receivers are most likely optimized away or commonized and can be ignored. Also all the FILTER_EXPR/OBJ_REF_EXPR sets/reads are probably going to disappear after lowering. What I have implementation for and would like to push to pretty-ipa and later for mainline after some more evaulation: - EH edge redirection. This is conceptually easy thing to do. When EH edge needs to be redirected and it is only edge to BB, we just update label in tree of EH handlers. When it is not only edge, we walk from the throwing statement region up to the outermost handler region in the EH tree and duplicate everything on a way. This breaks some assumptions EH code have. In particular on EH handler can do RESX in other handler and handlers can share labels. These assumptions are made in except.c, but they don't have to be: EH tree is really just on-side representaiton of decision tree of what happens after expression and is lowered to such for for dwarf. There is no need to know what happens after EH is delivered. While I don't expect immediate improvements in C++ codegen. I benchmarked it in C++ testers and there is some improvment in libstdc++ tests, little improvement in boot's wave and cwcessboard. Partly it is because all testcases we have do pretty much no EH except for one done implicitly by cleanup regions. The testcases improving in libstdc++ do have try....catch constructs in them. We probably could try to find some new benchmarks for C++ testsuite to cover this area as well as cases where it is not best solution to inline everything completely. Probably something like mozilla would do the job, but I don't know if there is easy to compile benchmark aplication for this. Our C++ testsuite is pretty much testsuite for specific kind of C++ coding requiring inliner to do a lot of job. This is not surprising because it was constructed with inliner tunning in mind. I think this is important infrastructure bit. In particular it seems to me that since this leaves us with abnormal edges produced by setjmp/longjmp, nonlocal labels and computed goto and because of computed goto factoring, only longjmp/setjmp edges are really unsplittable and thus we can get rid of abnormal phi handling and teach out-of-ssa to insert conditionals into abnormal edge receiving BB to resolve the unlikely case we will make copy on abnormal edge neccesary. Redirecting EH edges also allows sinking code needed only on EH path down to the edges that is quite common because of inlining differences above. - EH region merging: after critical edge splitting we produce a lot of copies of EH regions by EH edge redirecition. It is simple to merge them again in ehcleanup by hashing them and proving equivalence. - It is possible to DCE FILTER_EXPR (and OBJ_REF_EXPR). I use simple code that try to verify that value stored to FILTER_EXPR is read in same BB (and thus provably don't change value as described bellow) and do not set the magic needed bit on it in that case. This kills most of FILTER_EXPR sets. Some more or less experimental stuff I have: - We should finish the transition and move as much as possible of EH lowering up to gimple world. Producing post-landing pads in gimple is easy (they are really just conditionals on FILTER_EXPR/OBJ_REF_EXPR) and lowering RESX to goto is easy too. This allows scalar optimizations on the produced code. There is problem that after lowering it is no longer easy to do dead cleanup elimination. So we probably should do this transform somewhere in half of our optimization queue. I also experimented with lowering only the trivial cases pre-inlining. Some remaining issues: - FILTER_EXPR/OBJ_REF_EXPR is currently handled in quite dangerous way. Original Rth's code made them quite 100% volatile. Now we can PRE them. The FILTER_EXPR/OBJ_REF_EXPR are really hard registers in RTL world that are set by EH runtime on EH edges from throwing statement to handler (not at RESX edges) and they are used by pre-landing pads and also by RESX code. It would be more precise if RESX instruction took FILTER_EXPR/OBJ_REF_EXPR value as argument (since it really uses the values) so we can kill magicness of the sets. Problem is that I don't think there is good SSA representation for register that implicitly change over edges. Take the example call (); EH edge to handler 1: .... receiver 1: tmp1 = filter_expr; tmp2 = obj_ref_expr; call2 (); EH edge to handler 2 label1: filter_expr = tmp1 obj_ref_expr = tmp2 resx (tmp1, tmp2) handler 2: tmp3 = filter_expr; tmp4 = obj_ref_expr; if (conditional) goto label1: else filter_expr = tmp3 obj_ref_expr = tmp4 resx (tmp3, tmp4); In this case tmp1 != tmp3 and tmp2 != tmp4 and thus it is invalid to optimize second resx to (tmp1, tmp3). There is nothing to model this. I wonder if FILTER_EXPR/OBJ_REF_EXPR can't be best handled by just being volatile to majority of optimizations (i.e. assumed to change value all the time) and handle this in copyprop/PRE as specal case invalidating the value across EH edges? - The nature of code duplication in between cleanup at end of block and cleanup in EH actually brings a lot of tail merging possibilities. I wonder if we can handle this somehow effectivly on SSA. In RTL world crossjumping would catch thse cases if it was not almost 100% ineffective by fact that we hardly re-use same register for temporaries. I wonder if there is resonable SSA optimization that would have similar effect as tail merging here or if we want to implement tail merging on gimple. - Can we somehow conclude that structure being desturcted dies after the destructors so all writes to it are dead already in early optimizations? That would allow a lot more DSE and cheaper inlining. - It is possible to make EH edges redirectable on RTL too. I wonder if it is worth the effort however. - We ought to be able to prove finitarity of simple loops so we can DCE more early and won't rely on full loop unrolling to get rid of empty loops originally initializing dead arrays. Honza