[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2014-09-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285

Richard Biener  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |INVALID

--- Comment #10 from Richard Biener  ---
Please open new bugs.


[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2014-09-10 Thread mrs at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285

mrs at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
   Last reconfirmed||2014-09-10
 CC||mrs at gcc dot gnu.org
 Resolution|INVALID |---
 Ever confirmed|0   |1

--- Comment #9 from mrs at gcc dot gnu.org  ---
Customer unhappy.


[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-09 Thread anton at mips dot complang dot tuwien dot ac dot at


--- Comment #8 from anton at mips dot complang dot tuwien dot ac dot at  
2005-12-09 13:46 ---
Subject: Re:  pessimization of goto * ("computed goto")

rguenth at gcc dot gnu dot org wrote:
> 
> 
> 
> --- Comment #7 from rguenth at gcc dot gnu dot org  2005-12-09 11:51 
> ---
> > 2) If you do reorder the blocks, you should not merge indirect
> > branches on CPUs with BTBs, for better branch prediction.
> 
> I would rather say that you should not merge frequently executed
> indirect branches.

Yes, that's a refinement.

> There is certainly an upper bound of indirect
> branches after that merging is profitable again,

Yes, once the indirect branch gets evicted from the BTB between two
executions, you might just as well merge it with another indirect
branch of the same kind.  While not all rarely executed indirect
branches will have this property (the few executions might occur in a
burst), many of them will, and for the rest the number of executions
and thus the penalty of doing it wrong will be low.

OTOH, the potential speed benefit of this refinement is also low; so
the main benefit is probably in code size.

- anton


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285



[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-09 Thread rguenth at gcc dot gnu dot org


--- Comment #7 from rguenth at gcc dot gnu dot org  2005-12-09 11:51 ---
> 2) If you do reorder the blocks, you should not merge indirect
> branches on CPUs with BTBs, for better branch prediction.

I would rather say that you should not merge frequently executed
indirect branches.  There is certainly an upper bound of indirect
branches after that merging is profitable again, and only disabling
merging of frequently executed (from profile feedback) indirect
branches and retaining merging of the seldom executed ones will
probably be the best idea.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285



[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-08 Thread anton at mips dot complang dot tuwien dot ac dot at


--- Comment #6 from anton at mips dot complang dot tuwien dot ac dot at  
2005-12-08 21:31 ---
Subject: Re:  pessimization of goto * ("computed goto")

pinskia at gcc dot gnu dot org wrote:
> --- Comment #5 from pinskia at gcc dot gnu dot org  2005-12-06 21:58 
> ---
> (In reply to comment #4)
> > So, no, the code is not worse, but much better.  I hope this
> > workaround will continue to work in future versions.
> 
> You are wrong in general since this is a conditional indirect jump.  Since it
> is conditional it means that it is going to do a jump and the locatity reasons
> are that important as like in the old days when there was a little code 
> cache. 
> In fact have doing jne instead of jeq might cause the branch mispridected. 

Sorry, you lost me here.  Conditional branch predictors on current
general-purpose CPUs are history-based, and I would not expect any
difference in the accuracy of the conditional branch prediction.
However, for BTB-based indirect branch predictors (Pentium 3, Athlon
64, and (modulo replication from the trace cache) Pentium 4), the
branch prediction accuracy suffers quite a lot if you combine several
well-predictable indirect branches with different targets into a
single indirect branch.

See [ertl&gregg03] for a deeper discussion, in particular Section 3.
You can also read in Section 5.2 (towards the end) why we don't want
to have a jump to far-away places.

> Note if you were actually using a target which have conditional indirect jumps
> this would be a bug (PPC for an example from either lr or ctr register, see PR
> 25287 for a bug report about that).

Sure, having a conditional indirect jump in-line would be nice.

But if the architecture does not have it (or if gcc does not utilize
it), what I would like to see in the resulting code is:

1) We compile with -fno-reorder-blocks, so the indirect branch should
be in the place corresponding to the source code, not somewhere else.

2) If you do reorder the blocks, you should not merge indirect
branches on CPUs with BTBs, for better branch prediction.

BTW, the __asm__("") workaround works nicely (for now), so I could
produce numbers for the slowdown for this bug, if you are interested.

@InProceedings{ertl&gregg03,
  author =   "M. Anton Ertl and David Gregg",
  title ="Optimizing Indirect Branch Prediction Accuracy in Virtual
Machine Interpreters",
  crossref = "sigplan03",
  OPTpages = "",
  url = 
"http://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz";,
  abstract = "Interpreters designed for efficiency execute a huge
  number of indirect branches and can spend more than
  half of the execution time in indirect branch
  mispredictions.  Branch target buffers are the best
  widely available\mn{on all recent general-purpose
  machines?} form of indirect branch prediction;
  however, their prediction accuracy for existing
  interpretes is only 2\%--50\%.  In this paper we
  investigate two methods for improving the prediction
  accuracy of BTBs for interpreters: replicating
  virtual machine (VM) instructions and combining
  sequences of VM instructions into superinstructions.
  We investigate static (interpreter build-time) and
  dynamic (interpreter run-time) variants of these
  techniques and compare them and several combinations
  of these techniques.  These techniques can eliminate
  nearly all of the dispatch branch mispredictions,
  and have other benefits, resulting in speedups by a
  factor of up to 3.17 over efficient threaded-code
  interpreters, and speedups by a factor of up to 1.3
  over techniques relying on superinstructions alone."
}

- anton


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285



[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-06 Thread pinskia at gcc dot gnu dot org


--- Comment #5 from pinskia at gcc dot gnu dot org  2005-12-06 21:58 ---
(In reply to comment #4)
> So, no, the code is not worse, but much better.  I hope this
> workaround will continue to work in future versions.

You are wrong in general since this is a conditional indirect jump.  Since it
is conditional it means that it is going to do a jump and the locatity reasons
are that important as like in the old days when there was a little code cache. 
In fact have doing jne instead of jeq might cause the branch mispridected. 
Note if you were actually using a target which have conditional indirect jumps
this would be a bug (PPC for an example from either lr or ctr register, see PR
25287 for a bug report about that).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285



[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-06 Thread anton at mips dot complang dot tuwien dot ac dot at


--- Comment #4 from anton at mips dot complang dot tuwien dot ac dot at  
2005-12-06 21:46 ---
Subject: Re:  pessimization of goto * ("computed goto")

pinskia at gcc dot gnu dot org wrote:
> __asm__(""); makes stuff worse.

I just applied the following patch to engine1.i:

--- engine1.i   2005-12-06 22:34:08.0 +0100
+++ engine1.i~  2005-12-06 19:04:12.0 +0100
@@ -17562,7 +17562,7 @@
 {
 # 188 "./java.vmg"
 {
-  { if ((aArray) == ((void *)0)) { __asm__(""); goto
*throw_nullpointerexception; } };
+  { if ((aArray) == ((void *)0)) { goto *throw_nullpointerexception; } };
   { if (( ((java_arrayheader*)(aArray))->size ) <= (u4) (iIndex)) {
arrayindexoutofbounds_index = (iIndex); goto
*throw_arrayindexoutofboundsexception; } };
   ;
   vResult = java_intarray*)(aArray))->data)[iIndex]);

The result looks much better; in particular, instead of the

je  .L995

I now get:

jne .L762
movq-136(%rbp), %rdx
.LBE46:
.loc 2 231 0
jmp *%rdx
.L762:

I.e., a separate indirect jump for improved branch prediction (not
important in this case, because this indirect branch will rarely be
used).  More importantly, no direct branches to far-away places which
did destroy the relocatability (and thus suitability for "selective
inlining") of this code fragment.

So, no, the code is not worse, but much better.  I hope this
workaround will continue to work in future versions.

- anton


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285



[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-06 Thread pinskia at gcc dot gnu dot org


--- Comment #3 from pinskia at gcc dot gnu dot org  2005-12-06 20:57 ---
Since there is no conditional indirect jumps on x86 this is not a bug.


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285



[Bug rtl-optimization/25285] pessimization of goto * ("computed goto")

2005-12-06 Thread pinskia at gcc dot gnu dot org


--- Comment #2 from pinskia at gcc dot gnu dot org  2005-12-06 20:46 ---
__asm__(""); makes stuff worse.

Anyways it looks like the code in GCC only deals with non conditional jumps
(unless I am missing something).
Is there a conditional jump of an indirect jump for x86/x86_64?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285