On Thu, Aug 11, 2016 at 01:35:16PM +0200, Jan Hubicka wrote:
> > On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka <[email protected]> wrote:
> > > Hi,
> > > while looking into profile mismatches introduced by the backward
> > > threading pass
> > > I noticed that the heuristics seems quite simplistics. First it should be
> > > profile sensitive and disallow duplication when optimizing cold paths.
> > > Second
> > > it should use estimate_num_insns because gimple statement count is not
> > > really
> > > very realistic estimate of final code size effect and third there seems
> > > to be
> > > no reason to disable the pass for functions optimized for size.
> > >
> > > If we block duplication for more than 1 insns for size optimized paths
> > > the pass
> > > is able to do majority of threading decisions that are for free and
> > > improve codegen.
> > > The code size benefit was between 0.5% to 2.7% on testcases I tried
> > > (tramp3d,
> > > GCC modules, xlanancbmk and some other stuff around my hd).
> > >
> > > Bootstrapped/regtested x86_64-linux, seems sane?
> > >
> > > The pass should also avoid calling cleanup_cfg when no trheading was done
> > > and i do not see why it is guarded by expensive_optimizations. What are
> > > the
> > > main compile time complexity limitations?
> >
> > This patch caused a huge regression (~11%) on coremarks on ThunderX.
> > I assume other targets too.
> > Basically it looks like the path is no longer thread jumped.
>
> Sorry for late reply. I checked our periodic testers and the patch seems more
> or less performance neutral with some code size improvements. Can you point
> me to the path that is no longer crossjumped? I added diag output, so you
> should see the reason why the path was considered unprofitable - either it
> was cold or we exceeded the maximal size. The size is largely untuned, so
> perhaps we can just adjust it.
Yes, it is going to be the "cold" edge heuristic that does it.
FSM jump-thread path not considered: duplication of 18 insns is needed and
optimizing for size.
For whatever reason, in the case of the edges in this benchmark,
if I dump EDGE_FREQUENCY for the rejected edges in thread2 that we used to
accept they have frequency 0 (!).
This happens because the source block of the edge (the controlling switch)
has frequency 0:
;; basic block 7, loop depth 3, count 0, freq 0
;; prev block 6, next block 8, flags: (NEW, REACHABLE)
;; pred: 5 [95.0%] (FALSE_VALUE,EXECUTABLE)
switch (state_504) <default: <L31>, case 0: <L24>, case 2: <L25>, case 3:
<L28>, case 4: <L26>, case 5: <L27>, case 6: <L29>, case 7: <L30>>
;; succ: 36 [12.5%] (EXECUTABLE)
;; 8 [12.5%] (EXECUTABLE)
;; 18 [12.5%] (EXECUTABLE)
;; 30 [12.5%] (EXECUTABLE)
;; 22 [12.5%] (EXECUTABLE)
;; 26 [12.5%] (EXECUTABLE)
;; 32 [12.5%] (EXECUTABLE)
;; 34 [12.5%] (EXECUTABLE)
It looks like it is thread1 that creates these blocks with empty
frequency information.
I haven't managed to find a reduced testcase that shows this issue.
Thanks,
James