[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285 Richard Biener changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |INVALID --- Comment #10 from Richard Biener --- Please open new bugs.
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285 mrs at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|REOPENED Last reconfirmed||2014-09-10 CC||mrs at gcc dot gnu.org Resolution|INVALID |--- Ever confirmed|0 |1 --- Comment #9 from mrs at gcc dot gnu.org --- Customer unhappy.
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #8 from anton at mips dot complang dot tuwien dot ac dot at 2005-12-09 13:46 --- Subject: Re: pessimization of goto * ("computed goto") rguenth at gcc dot gnu dot org wrote: > > > > --- Comment #7 from rguenth at gcc dot gnu dot org 2005-12-09 11:51 > --- > > 2) If you do reorder the blocks, you should not merge indirect > > branches on CPUs with BTBs, for better branch prediction. > > I would rather say that you should not merge frequently executed > indirect branches. Yes, that's a refinement. > There is certainly an upper bound of indirect > branches after that merging is profitable again, Yes, once the indirect branch gets evicted from the BTB between two executions, you might just as well merge it with another indirect branch of the same kind. While not all rarely executed indirect branches will have this property (the few executions might occur in a burst), many of them will, and for the rest the number of executions and thus the penalty of doing it wrong will be low. OTOH, the potential speed benefit of this refinement is also low; so the main benefit is probably in code size. - anton -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #7 from rguenth at gcc dot gnu dot org 2005-12-09 11:51 --- > 2) If you do reorder the blocks, you should not merge indirect > branches on CPUs with BTBs, for better branch prediction. I would rather say that you should not merge frequently executed indirect branches. There is certainly an upper bound of indirect branches after that merging is profitable again, and only disabling merging of frequently executed (from profile feedback) indirect branches and retaining merging of the seldom executed ones will probably be the best idea. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #6 from anton at mips dot complang dot tuwien dot ac dot at 2005-12-08 21:31 --- Subject: Re: pessimization of goto * ("computed goto") pinskia at gcc dot gnu dot org wrote: > --- Comment #5 from pinskia at gcc dot gnu dot org 2005-12-06 21:58 > --- > (In reply to comment #4) > > So, no, the code is not worse, but much better. I hope this > > workaround will continue to work in future versions. > > You are wrong in general since this is a conditional indirect jump. Since it > is conditional it means that it is going to do a jump and the locatity reasons > are that important as like in the old days when there was a little code > cache. > In fact have doing jne instead of jeq might cause the branch mispridected. Sorry, you lost me here. Conditional branch predictors on current general-purpose CPUs are history-based, and I would not expect any difference in the accuracy of the conditional branch prediction. However, for BTB-based indirect branch predictors (Pentium 3, Athlon 64, and (modulo replication from the trace cache) Pentium 4), the branch prediction accuracy suffers quite a lot if you combine several well-predictable indirect branches with different targets into a single indirect branch. See [ertl&gregg03] for a deeper discussion, in particular Section 3. You can also read in Section 5.2 (towards the end) why we don't want to have a jump to far-away places. > Note if you were actually using a target which have conditional indirect jumps > this would be a bug (PPC for an example from either lr or ctr register, see PR > 25287 for a bug report about that). Sure, having a conditional indirect jump in-line would be nice. But if the architecture does not have it (or if gcc does not utilize it), what I would like to see in the resulting code is: 1) We compile with -fno-reorder-blocks, so the indirect branch should be in the place corresponding to the source code, not somewhere else. 2) If you do reorder the blocks, you should not merge indirect branches on CPUs with BTBs, for better branch prediction. BTW, the __asm__("") workaround works nicely (for now), so I could produce numbers for the slowdown for this bug, if you are interested. @InProceedings{ertl&gregg03, author = "M. Anton Ertl and David Gregg", title ="Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters", crossref = "sigplan03", OPTpages = "", url = "http://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz";, abstract = "Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available\mn{on all recent general-purpose machines?} form of indirect branch prediction; however, their prediction accuracy for existing interpretes is only 2\%--50\%. In this paper we investigate two methods for improving the prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and combining sequences of VM instructions into superinstructions. We investigate static (interpreter build-time) and dynamic (interpreter run-time) variants of these techniques and compare them and several combinations of these techniques. These techniques can eliminate nearly all of the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of up to 3.17 over efficient threaded-code interpreters, and speedups by a factor of up to 1.3 over techniques relying on superinstructions alone." } - anton -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #5 from pinskia at gcc dot gnu dot org 2005-12-06 21:58 --- (In reply to comment #4) > So, no, the code is not worse, but much better. I hope this > workaround will continue to work in future versions. You are wrong in general since this is a conditional indirect jump. Since it is conditional it means that it is going to do a jump and the locatity reasons are that important as like in the old days when there was a little code cache. In fact have doing jne instead of jeq might cause the branch mispridected. Note if you were actually using a target which have conditional indirect jumps this would be a bug (PPC for an example from either lr or ctr register, see PR 25287 for a bug report about that). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #4 from anton at mips dot complang dot tuwien dot ac dot at 2005-12-06 21:46 --- Subject: Re: pessimization of goto * ("computed goto") pinskia at gcc dot gnu dot org wrote: > __asm__(""); makes stuff worse. I just applied the following patch to engine1.i: --- engine1.i 2005-12-06 22:34:08.0 +0100 +++ engine1.i~ 2005-12-06 19:04:12.0 +0100 @@ -17562,7 +17562,7 @@ { # 188 "./java.vmg" { - { if ((aArray) == ((void *)0)) { __asm__(""); goto *throw_nullpointerexception; } }; + { if ((aArray) == ((void *)0)) { goto *throw_nullpointerexception; } }; { if (( ((java_arrayheader*)(aArray))->size ) <= (u4) (iIndex)) { arrayindexoutofbounds_index = (iIndex); goto *throw_arrayindexoutofboundsexception; } }; ; vResult = java_intarray*)(aArray))->data)[iIndex]); The result looks much better; in particular, instead of the je .L995 I now get: jne .L762 movq-136(%rbp), %rdx .LBE46: .loc 2 231 0 jmp *%rdx .L762: I.e., a separate indirect jump for improved branch prediction (not important in this case, because this indirect branch will rarely be used). More importantly, no direct branches to far-away places which did destroy the relocatability (and thus suitability for "selective inlining") of this code fragment. So, no, the code is not worse, but much better. I hope this workaround will continue to work in future versions. - anton -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #3 from pinskia at gcc dot gnu dot org 2005-12-06 20:57 --- Since there is no conditional indirect jumps on x86 this is not a bug. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||INVALID http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")
--- Comment #2 from pinskia at gcc dot gnu dot org 2005-12-06 20:46 --- __asm__(""); makes stuff worse. Anyways it looks like the code in GCC only deals with non conditional jumps (unless I am missing something). Is there a conditional jump of an indirect jump for x86/x86_64? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285