[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #26 from hubicka at gcc dot gnu dot org 2008-09-06 12:00 --- IRA seems to fix the remaining problem with spill in internal loop on 32bit nicely, so we produce good scores for gzip compared to older GCC versions. http://gcc.opensuse.org/SPEC-britten/CINT/sandbox-britten-32bit/164_gzip_big.png and with profile feedback http://gcc.opensuse.org/SPEC-britten/CINT/sandbox-britten-FDO/164_gzip_big.png we get close to ICC scores. We now output comparsion loop as: .L98: movzbl 1(%eax), %edx #, leal1(%eax), %edi #, scan cmpb1(%ecx), %dl#, jne .L161 #, movzbl 2(%eax), %edx #, leal2(%eax), %edi #, scan cmpb2(%ecx), %dl#, jne .L161 #, movzbl 3(%eax), %edx #, leal3(%eax), %edi #, scan cmpb3(%ecx), %dl#, jne .L161 #, movzbl 4(%eax), %edx #, leal4(%eax), %edi #, scan cmpb4(%ecx), %dl#, jne .L161 #, movzbl 5(%eax), %edx #, leal5(%eax), %edi #, scan cmpb5(%ecx), %dl#, jne .L161 #, movzbl 6(%eax), %edx #, leal6(%eax), %edi #, scan cmpb6(%ecx), %dl#, jne .L161 #, movzbl 7(%eax), %edx #, leal7(%eax), %edi #, scan cmpb7(%ecx), %dl#, jne .L161 #, there is still room for improvement however. Remaining problem is that we still miss coaliescing of scan_end and scan_end1 (so -fno-tree-dominator-opts -fno-tree-copyrename still helps). Vladimir, perhaps this can be solved in IRA too? Honza -- hubicka at gcc dot gnu dot org changed: What|Removed |Added CC||vmakarov at redhat dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #27 from hubicka at gcc dot gnu dot org 2008-09-06 12:02 --- Also just noticed that offline copy of longest-match get extra move: .L15: movzbl 2(%eax), %edi #, tmp87 leal2(%eax), %ecx #, scan.158 movl%edi, %edx # tmp87, cmpb2(%ebx), %dl#, jne .L6 #, movzbl 3(%eax), %edi #, tmp88 leal3(%eax), %ecx #, scan.158 movl%edi, %edx # tmp88, cmpb3(%ebx), %dl#, jne .L6 #, movzbl 4(%eax), %edi #, tmp89 leal4(%eax), %ecx #, scan.158 movl%edi, %edx # tmp89, cmpb4(%ebx), %dl#, jne .L6 #, movzbl 5(%eax), %edi #, tmp90 leal5(%eax), %ecx #, scan.158 movl%edi, %edx # tmp90, cmpb5(%ebx), %dl#, jne .L6 #, while inlined copy is fine: .L98: movzbl 1(%eax), %edx #, leal1(%eax), %edi #, scan cmpb1(%ecx), %dl#, jne .L161 #, movzbl 2(%eax), %edx #, leal2(%eax), %edi #, scan cmpb2(%ecx), %dl#, jne .L161 #, movzbl 3(%eax), %edx #, leal3(%eax), %edi #, scan cmpb3(%ecx), %dl#, jne .L161 #, movzbl 4(%eax), %edx #, leal4(%eax), %edi #, scan cmpb4(%ecx), %dl#, jne .L161 #, interesting :) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #25 from hubicka at gcc dot gnu dot org 2008-02-08 15:39 --- -fno-tree-dominator-opts -fno-tree-copyrename solves the coalescing problem (name is introduced by second, the actual problematic pattern by first pass), saving roughly 1s at both -O2 and 2s at -O3, -O3 is still worse however Internal loop no longer spills, just reads val of scan_end stored in memory. I will play with it more later and make simple testcase for this. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #24 from hubicka at gcc dot gnu dot org 2008-02-08 15:11 --- Hi, the tonight runs with continue heuristics shows again improvements on 64bit scores , but degradation on 32bit scores. Looking into the loop, the real trouble seems to be that the main loop has 6 loop carried variables: scan_end, scan_end1, best_len, scan, chain_length, cur_match plus few temporaries are needed too. Obviously we can't fit in registers on i386. Making profile more realistic sometime helps sometimes hurts pretty much at random basis. One case where I think register presure is increased is the fact that different SSA names of both scan_end and scan_end1 variables are actually not fully coalesced in out-of-SSA. This is result of optimizing: if (match[best_len] != scan_end || match[best_len-1] != scan_end1 || *match != *scan || *++match != scan[1]) continue; ...later code sometimes modifying scan_end into computing match[best_len] into name of scan_end that is sometimes assigned int the later code on the path not modifying scan_end. As a result we do have two scan_ends live at once. I wonder if we can avoid this behaviour, though it looks all right on SSA form, it would save 2 global registers: there is no need at all to cache match[best_len]/match[best_len1] in register unless I missed something. Those two vars are manipulated on the hot paths through the loop. Now the RA is driven by frequencies (bit confused by fact that two of loop carried vars are split) and by their liveranges that is actually number of instructions in bettween first and last occurence. Since we are bit carelless on BB ordering moving some code to the very end of function, this heuristics is not realistic at all. It would probably make more sense to replace it by number of inssn it is live across, but this is probably ninsn*npseudos to compute. Other idea would be degree in conflict graph, but I am not sure we want to start such experiemtns in parallel with YARA. I tested YARA and it does not handle this situation much better. Perhaps Vladimir can help? Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #23 from hubicka at gcc dot gnu dot org 2008-02-07 12:30 --- Created an attachment (id=15115) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15115action=view) Annotated profile I am attaching dump with profile read in. It shows the hot spots in longest_match at least: (this is first conditional of the continue guard) # BLOCK 27 freq:1 count:1346119696 # PRED: 6 [100.0%] count:112241556 (fallthru) 25 [99.5%] count:1233878140 (true,exec) # scan_end_13 = PHI scan_end_106(6), scan_end_14(25) # scan_end1_11 = PHI scan_end1_93(6), scan_end1_12(25) # best_len_8 = PHI best_len_25(6), best_len_9(25) # scan_3 = PHI scan_24(6), scan_6(25) # chain_length_2 = PHI chain_length_108(6), chain_length_105(25) # cur_match_1 = PHI cur_match_109(6), cur_match_104(25) match_40 = window + cur_match_1; best_len.31_41 = (unsigned int) best_len_8; D.2379_42 = match_40 + best_len.31_41; D.2380_43 = *D.2379_42; if (D.2380_43 != scan_end_13) goto bb 10; else goto bb 7; # SUCC: 10 [0.1%] count:33977 (true,exec) 11 [99.9%] count:48979565 (false,exec) # BLOCK 10 freq:9636 count:1297140131 # PRED: 27 [87.5%] count:1177665163 (true,exec) 7 [55.2%] count:93018627 (true,exec) 8 [35.0%] count:26422364 (true,exec) 9 [0.1%] count:33977 (true,exec) goto bb 24; (this is the continue statement) D.2391_102 = cur_match_1 32767; D.2392_103 = prev[D.2391_102]; cur_match_104 = (IPos) D.2392_103; if (limit_15 = cur_match_104) goto bb 26; else goto bb 25; # SUCC: 26 [7.7%] count:104056913 (true,exec) 25 [92.3%] count:1240391903 (false,exec) # BLOCK 25 freq:9215 count:1240391903 # PRED: 24 [92.3%] count:1240391903 (false,exec) chain_length_105 = chain_length_2 + 0x0; if (chain_length_105 != 0) goto bb 27; else goto bb 26; (this is end of outer loop) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #17 from hubicka at gcc dot gnu dot org 2008-02-06 13:28 --- One problem is the following: do { ; match = window + cur_match; if (match[best_len] != scan_end || match[best_len-1] != scan_end1 || *match != *scan || *++match != scan[1]) continue; scan += 2, match++; do { } while (*++scan == *++match *++scan == *++match *++scan == *++match *++scan == *++match *++scan == *++match *++scan == *++match *++scan == *++match *++scan == *++match scan strend); The internal loop is the string comparsion thingy, while the branch prediction logic completely misses it: the continue statement looks like it is forming 4 nested loops, so it concludes that this is the internal loop. We used to have prediction heuristic guessing that continue statement is not used to form a loop. This was killed when gimplification was introduced. Perhaps we should bring it back, since this is resonably common scenario. Looking at longest_match in not unrolled version, the loops formed by continue statement has frequencies: 298, 961, 2139, 3100, 6900, 1000 so every loop is predicted to iterate about twice. The outer real loop now gets frequency 92, ie small enough to be predicted as cold. The string comparsion loop now get freuqnecy 344, predicted to iterate 3 times (quite realistically). But because the frequency is so small we end up allocating one of the two pointers in memory: .L9: leal1(%ecx), %eax movl%eax, -16(%ebp) movzbl 1(%ecx), %eax cmpb1(%edx), %al jne .L8 leal2(%ecx), %eax movl%eax, -16(%ebp) movzbl 2(%ecx), %eax cmpb2(%edx), %al jne .L8 leal3(%ecx), %eax movl%eax, -16(%ebp) movzbl 3(%ecx), %eax cmpb3(%edx), %al jne .L8 leal4(%ecx), %eax movl%eax, -16(%ebp) movzbl 4(%ecx), %eax cmpb4(%edx), %al jne .L8 leal5(%ecx), %eax movl%eax, -16(%ebp) movzbl 5(%ecx), %eax cmpb5(%edx), %al jne .L8 This happens in offline copy of longest_match. The inline gets this detail right, but frequencies of the deflate functions are all crazy, naturally. I guess I should revive the patch for language scope branch predictors. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #18 from hubicka at gcc dot gnu dot org 2008-02-06 16:44 --- Created an attachment (id=15107) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15107action=view) Path to predict_paths_leading_to Hi, I've revived the continue heuristic patch. By itself it does not help becuase of bug in predict_paths_leading_to. The code looks as follows: if (test1) goto continue_block; if (test2) goto continue_block; if (test3) goto continue_block; if (test4) goto continue_block; goto real_loop_body; continue_block: goto loop_header; We call predict_paths_leading_to on the continue_block and expect that the continue_block will not be very likely. What the function does is to find dominator of continue_block that is the if(test1) block and predict edge from the first block. This is however not quite enough as all the other paths remain likely. It seems to me that we need to walk the whole set of BBs postdominated by the BB and mark all edges forming edge cut defined by this set. I am testing the attached patch. It makes the function linear (so we are overall quadratic) for very deep postdominator tree. If this turns out to be problem, I think we can just cut the computation after some specified amount of BBs is walked. Zdenek, does this seem sane? With this change and continue prediction patch I get sort of sane prediction for longest_match function. Profile is still quite unrealistic, but I am testing if it makes noticeable difference. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #19 from hubicka at gcc dot gnu dot org 2008-02-06 16:56 --- Created an attachment (id=15108) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15108action=view) Complete continue heuristic patch Hi, this is the complete patch. With this patch we produce profile sane enough so the internal loops are not marked cold. I will benchmark it probably tomorrow (I want to wait for the FP changes to show separately). It fixes the offline copy of longest_match, so we no longer have one of IV variables at stack: .L15: movzbl 2(%edx), %eax leal2(%edx), %esi cmpb2(%ecx), %al jne .L8 movzbl 3(%edx), %eax leal3(%edx), %esi cmpb3(%ecx), %al jne .L8 movzbl 4(%edx), %eax leal4(%edx), %esi cmpb4(%ecx), %al jne .L8 movzbl 5(%edx), %eax leal5(%edx), %esi cmpb5(%ecx), %al jne .L8 movzbl 6(%edx), %eax leal6(%edx), %esi cmpb6(%ecx), %al jne .L8 movzbl 7(%edx), %eax leal7(%edx), %esi cmpb7(%ecx), %al jne .L8 leal8(%ecx), %eax movl%eax, %ecx movzbl 8(%edx), %eax cmpb(%ecx), %al leal8(%edx), %ebx movl%ebx, %esi jne .L8 cmpl%ebx, -20(%ebp) jbe .L8 movl%ebx, %edx movzbl 1(%edx), %eax leal1(%edx), %esi cmpb1(%ecx), %al je .L15 Irronically this can further widen the gap in between -O2 and -O3, since the inline copy in deflate was always allocated resonably. Deflate codegen changes quite a lot and because function body is big I will wait for benchmarks before trying to analyze futher. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #20 from ubizjak at gmail dot com 2008-02-06 18:42 --- Whoa, adding -fomit-frame-pointer brings us from (gcc -O3 -m32) user0m41.031s to (gcc -O3 -m32 -fomit-frame-pointer) user0m30.006s Since -fo-f-p adds another free reg, it looks that since inlining increases register pressure some unlucky heavy-used variable gets allocated to the stack slot. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #21 from ubizjak at gmail dot com 2008-02-06 19:10 --- (In reply to comment #20) Since -fo-f-p adds another free reg, it looks that since inlining increases register pressure some unlucky heavy-used variable gets allocated to the stack slot. It is best_len (and probably some others, too): [EMAIL PROTECTED] gzip-1.2.4]$ grep best_len fp.s movl%edx, -68(%ebp) #, best_len movl-68(%ebp), %edx # best_len, best_len.494 movl%edx, -68(%ebp) # best_len.494, best_len movl-68(%ebp), %edx # best_len, movl-68(%ebp), %edx # best_len, movl-68(%ebp), %edx # best_len, best_len.494 cmpl%esi, %edx # lookahead, best_len.494 movl%edx, -108(%ebp)# best_len.494, match_length movl-68(%ebp), %edx # best_len, best_len.494 movl%edx, -88(%ebp) # prev_length.28, best_len movl-88(%ebp), %edx # best_len, best_len.457 movl%edx, -88(%ebp) # best_len.457, best_len movl-88(%ebp), %eax # best_len, movl-88(%ebp), %edx # best_len, movl-88(%ebp), %edx # best_len, best_len.457 cmpl%esi, %edx # lookahead, best_len.457 movl%edx, -40(%ebp) # best_len.457, match_length.404 movl-88(%ebp), %edx # best_len, best_len.457 leal(%ecx,%eax), %edx #, best_len.457 cmpl%edx, -88(%ebp) # best_len.457, best_len cmpl-96(%ebp), %edx # nice_match.34, best_len.457 leal(%ecx,%eax), %edx #, best_len.494 cmpl%edx, -68(%ebp) # best_len.494, best_len cmpl-76(%ebp), %edx # nice_match.34, best_len.494 [EMAIL PROTECTED] gzip-1.2.4]$ grep best_len no-fp.s movl%edx, 76(%esp) #, best_len movl76(%esp), %edx # best_len, movl76(%esp), %edx # best_len, best_len.494 movl%edx, 76(%esp) # best_len.494, best_len movl76(%esp), %eax # best_len, movl76(%esp), %edx # best_len, best_len.494 movl%edx, %ebp # best_len.494, match_length movl76(%esp), %edx # best_len, best_len.494 movl%edx, %ebp # prev_length.28, best_len movl%ebp, %edx # best_len, best_len.457 movl%edx, %ebp # best_len.457, best_len movl%ebp, %edx # best_len, best_len.457 cmpl%esi, %edx # lookahead, best_len.457 movl%ebp, %edx # best_len, best_len.457 leal(%ecx,%eax), %edx #, best_len.494 cmpl%edx, 76(%esp) # best_len.494, best_len cmpl68(%esp), %edx # nice_match.34, best_len.494 leal(%ecx,%eax), %edx #, best_len.457 cmpl%edx, %ebp # best_len.457, best_len cmpl52(%esp), %edx # nice_match.34, best_len.457 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #22 from hubicka at gcc dot gnu dot org 2008-02-06 19:22 --- Yes, there are number of unlucky variables. However the real source is here seems to be always wrong profile guiding regalloc to optimize for cold portions of the function rather than real increase of register pressure increase due to inlining. In general, inlining operation itself only decrease register pressure: you don't fix function parameters/return value to fixed registers and you know precisely what registers survive the body so you don't need to save caller saved registers when not needed. The losses from inlining with our regalloc is partly due to callee saved registers being sometimes more effective sort of immitating live range splitting. Increased register pressure is effect of propagating from function body to the rest of program, but it is not that bat either: at least all the inlining heuristic/RA bugs turned to be something else. The high speedup by forwprop patch in 64bit mode (and slowdown in 32bit) is actually also register allocation related: the internal loop consisting of sequence of ++ operations ends up with extra copy instructions without forwprop patch, while with the patch we produce normal induction variable. On 32bit it however results in regalloc putting this variable on stack because its liferange heuristics gives it lower priority then. For 32bit data, britten 32-bit SPEC tester peaked at 760, while we now get 620 on peak with -fomit-frame-pointer. 20% regression on rather simple commonly used codebase definitly makes us look stupid More though that ICC 7.x did 820 on same machine. 64bit tester is 830 versus 740 approximately. Honza -- hubicka at gcc dot gnu dot org changed: What|Removed |Added CC||hubicka at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #15 from hubicka at gcc dot gnu dot org 2008-02-05 13:36 --- Thanks, looks comparable to K8 scores, except that -O3 is not actually that worse there. So it looks there is more than just random effect of code layout involved, I will try to look into the assembly produced more. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #16 from hubicka at gcc dot gnu dot org 2008-02-05 13:55 --- Thanks, looks comparable to K8 scores, except that -O3 is not actually that worse there. So it looks there is more than just random effect of code layout involved, I will try to look into the assembly produced more. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #13 from hubicka at gcc dot gnu dot org 2008-02-03 13:39 --- Tonight runs on haydn with patch in shows regression on gzip: 950-901 in 32bit. FDO 64bit runs are not affected. This is same score as we had in December, we improved a bit since then but not enough to match score we used to have. Looks like codegen of the string compare loop is very unstable here. Uros, would be possible to give it a try on Core? That would help to figure out if it is code layout problem of K8. Honza -- hubicka at gcc dot gnu dot org changed: What|Removed |Added Last reconfirmed|2007-12-10 10:14:39 |2008-02-03 13:39:42 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #14 from ubizjak at gmail dot com 2008-02-03 17:35 --- (In reply to comment #13) Uros, would be possible to give it a try on Core? That would help to figure out if it is code layout problem of K8. Hm, the patch doesn't seem to help: -m32 -O2: 32.434 -m32 -O2 (patched): 32.586 -m32 -O3: 40.723 -m32 -O3 (patched): 41.059 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #12 from hubicka at gcc dot gnu dot org 2008-02-02 16:22 --- Created an attachment (id=15079) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15079action=view) address accumulation patch While working on PR17863 I wrote the attached patch to make fwprop to combine code like: a=base; *a=something; a++; *a=something; a++; *a=something; ... into *base=something a=base+1 *a=something a=base+2 *a=something I dropped it to vangelis and nightly tester shows gzip improvement 815-880. Gzip internal loop is hand unrolled into similar form as shown above. (the tester peaks in Jul 2005 with scores somewhat above 900). Since it gzip results tends to be unstable it would be nice to know how this reproduce on other targets/setups. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #11 from hubicka at gcc dot gnu dot org 2008-01-16 16:46 --- Last time I looked into it, it was code alignment affected by inlining in the string matching loop (longest_match). This code is very atypical, since the internal loop comparing strings is hand unrolled but it almost never rolls, since the compressed strings tends to be all different. GCC mispredicts this moving some stuff out of the loop and bb-reorder aligns the code in a way that the default path not doing the loop is jumping pretty far hurting decode bandwidth of K8 especially because the jumps are hard to predict. I don't see any direct things in the code heuristics can use to realize that the loop is not rooling, except for special casing the particular benchmark. FDO scores of gzip are not doing that bad, but there is still gap relative to ICC (even archaic version of it running 32bit compared to 64bit GCC). http://www.suse.de/~gcctest/SPEC-britten/CINT/sandbox-britten-FDO/index.html It would be nice to convince gzip/zlibc/bzip2 people to use profiling by default in the build process - those packages are ideal targets. But since core is not that much sensitive to code alignment and nuber of jumps as K8, perhaps there are extra problems demonstrated by this. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #3 from rguenth at gcc dot gnu dot org 2007-12-10 10:52 --- I don't think this qualifies as a 4.3 regression - http://www.suse.de/~gcctest/SPEC/CINT/sb-haydn-head-64-32o-32bit/index.html shows that while there were jumps, the numbers close to the 4.2 release are actually quite similar to what we have now. So, unless somebody produces numbers with 4.2 or earlier, this is not a 'regression', but a missed-optimization only. -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||rguenth at gcc dot gnu dot ||org Component|target |tree-optimization Keywords||missed-optimization Summary|[4.3 regression] non-optimal|non-optimal inlining |inlining heuristics |heuristics pessimizes gzip |pessimizes gzip SPEC score |SPEC score at -O3 |at -O3 | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #4 from ubizjak at gmail dot com 2007-12-10 12:31 --- (In reply to comment #3) I don't think this qualifies as a 4.3 regression - Fair enough. It looks that this problem is specific to Core2. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #5 from ubizjak at gmail dot com 2007-12-10 17:12 --- (In reply to comment #4) Fair enough. It looks that this problem is specific to Core2. Here are timings with 'gcc version 4.3.0 20071201 (experimental) [trunk revision 130554] (GCC)' on vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU X6800 @ 2.93GHz stepping: 5 cpu MHz : 2933.389 cache size : 4096 KB -mtune=generic -m32 -O3: 40.763s [*] -mtune=generic -m32 -O2: 32.170s -mtune=core2 -m32 -O3 : 36.850s -mtune=core2 -m32 -O2 : 32.170s -mtune=generic -m64 -O3: 28.550s -mtune=generic -m64 -O2: 28.682s -mtune=core2 -m64 -O3 : 28.670s -mtune=core2 -m64 -O2 : 28.714s With __attribute__((noinline)) to longest_match(): -mtune=generic -m32 -O3: 30.658s -mtune=generic -m32 -O2: 32.154s -mtune=core2 -m32 -O3 : 30.690s -mtune=core2 -m32 -O2 : 32.247s And with FC6 system compiler 'gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)': -mtune=generic -m32 -O3: 30.154s [**] -mtune=generic -m32 -O2: 30.275s Comparing [*] to [**], it _is_ a regression, at least on Core2. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #6 from rguenther at suse dot de 2007-12-10 17:13 --- Subject: Re: non-optimal inlining heuristics pessimizes gzip SPEC score at -O3 On Mon, 10 Dec 2007, ubizjak at gmail dot com wrote: (In reply to comment #4) Fair enough. It looks that this problem is specific to Core2. Here are timings with 'gcc version 4.3.0 20071201 (experimental) [trunk revision 130554] (GCC)' on vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU X6800 @ 2.93GHz stepping: 5 cpu MHz : 2933.389 cache size : 4096 KB -mtune=generic -m32 -O3: 40.763s [*] -mtune=generic -m32 -O2: 32.170s -mtune=core2 -m32 -O3 : 36.850s -mtune=core2 -m32 -O2 : 32.170s -mtune=generic -m64 -O3: 28.550s -mtune=generic -m64 -O2: 28.682s -mtune=core2 -m64 -O3 : 28.670s -mtune=core2 -m64 -O2 : 28.714s With __attribute__((noinline)) to longest_match(): -mtune=generic -m32 -O3: 30.658s -mtune=generic -m32 -O2: 32.154s -mtune=core2 -m32 -O3 : 30.690s -mtune=core2 -m32 -O2 : 32.247s And with FC6 system compiler 'gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)': -mtune=generic -m32 -O3: 30.154s [**] -mtune=generic -m32 -O2: 30.275s Comparing [*] to [**], it _is_ a regression, at least on Core2. FSF GCC 4.1 does not have -mtune=generic. Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761
[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3
--- Comment #7 from ubizjak at gmail dot com 2007-12-10 17:26 --- (In reply to comment #6) FSF GCC 4.1 does not have -mtune=generic. OK, OK. Now with 'gcc version 4.1.3 20070716 (prerelease)': -m32 -O2: 29.306s -m32 -O3: 29.582s I don't have 4.2 here. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761