On Thu, Aug 11, 2016 at 01:35:16PM +0200, Jan Hubicka wrote: > > On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka <hubi...@ucw.cz> wrote: > > > Hi, > > > while looking into profile mismatches introduced by the backward > > > threading pass > > > I noticed that the heuristics seems quite simplistics. First it should be > > > profile sensitive and disallow duplication when optimizing cold paths. > > > Second > > > it should use estimate_num_insns because gimple statement count is not > > > really > > > very realistic estimate of final code size effect and third there seems > > > to be > > > no reason to disable the pass for functions optimized for size. > > > > > > If we block duplication for more than 1 insns for size optimized paths > > > the pass > > > is able to do majority of threading decisions that are for free and > > > improve codegen. > > > The code size benefit was between 0.5% to 2.7% on testcases I tried > > > (tramp3d, > > > GCC modules, xlanancbmk and some other stuff around my hd). > > > > > > Bootstrapped/regtested x86_64-linux, seems sane? > > > > > > The pass should also avoid calling cleanup_cfg when no trheading was done > > > and i do not see why it is guarded by expensive_optimizations. What are > > > the > > > main compile time complexity limitations? > > > > This patch caused a huge regression (~11%) on coremarks on ThunderX. > > I assume other targets too. > > Basically it looks like the path is no longer thread jumped. > > Sorry for late reply. I checked our periodic testers and the patch seems more > or less performance neutral with some code size improvements. Can you point > me to the path that is no longer crossjumped? I added diag output, so you > should see the reason why the path was considered unprofitable - either it > was cold or we exceeded the maximal size. The size is largely untuned, so > perhaps we can just adjust it.
Yes, it is going to be the "cold" edge heuristic that does it. FSM jump-thread path not considered: duplication of 18 insns is needed and optimizing for size. For whatever reason, in the case of the edges in this benchmark, if I dump EDGE_FREQUENCY for the rejected edges in thread2 that we used to accept they have frequency 0 (!). This happens because the source block of the edge (the controlling switch) has frequency 0: ;; basic block 7, loop depth 3, count 0, freq 0 ;; prev block 6, next block 8, flags: (NEW, REACHABLE) ;; pred: 5 [95.0%] (FALSE_VALUE,EXECUTABLE) switch (state_504) <default: <L31>, case 0: <L24>, case 2: <L25>, case 3: <L28>, case 4: <L26>, case 5: <L27>, case 6: <L29>, case 7: <L30>> ;; succ: 36 [12.5%] (EXECUTABLE) ;; 8 [12.5%] (EXECUTABLE) ;; 18 [12.5%] (EXECUTABLE) ;; 30 [12.5%] (EXECUTABLE) ;; 22 [12.5%] (EXECUTABLE) ;; 26 [12.5%] (EXECUTABLE) ;; 32 [12.5%] (EXECUTABLE) ;; 34 [12.5%] (EXECUTABLE) It looks like it is thread1 that creates these blocks with empty frequency information. I haven't managed to find a reduced testcase that shows this issue. Thanks, James