[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-09-06 Thread hubicka at gcc dot gnu dot org


--- Comment #26 from hubicka at gcc dot gnu dot org  2008-09-06 12:00 
---
IRA seems to fix the remaining problem with spill in internal loop on 32bit
nicely, so we produce good scores for gzip compared to older GCC versions. 
http://gcc.opensuse.org/SPEC-britten/CINT/sandbox-britten-32bit/164_gzip_big.png
and with profile feedback
http://gcc.opensuse.org/SPEC-britten/CINT/sandbox-britten-FDO/164_gzip_big.png
we get close to ICC scores.

We now output comparsion loop as:
.L98:   
movzbl  1(%eax), %edx   #,
leal1(%eax), %edi   #, scan
cmpb1(%ecx), %dl#,
jne .L161   #,
movzbl  2(%eax), %edx   #,
leal2(%eax), %edi   #, scan
cmpb2(%ecx), %dl#,
jne .L161   #,
movzbl  3(%eax), %edx   #,
leal3(%eax), %edi   #, scan
cmpb3(%ecx), %dl#,
jne .L161   #,
movzbl  4(%eax), %edx   #,
leal4(%eax), %edi   #, scan
cmpb4(%ecx), %dl#,
jne .L161   #,
movzbl  5(%eax), %edx   #,
leal5(%eax), %edi   #, scan
cmpb5(%ecx), %dl#,
jne .L161   #,
movzbl  6(%eax), %edx   #,
leal6(%eax), %edi   #, scan
cmpb6(%ecx), %dl#,
jne .L161   #,
movzbl  7(%eax), %edx   #,
leal7(%eax), %edi   #, scan
cmpb7(%ecx), %dl#,
jne .L161   #,

there is still room for improvement however.

Remaining problem is that we still miss coaliescing of scan_end and scan_end1
(so -fno-tree-dominator-opts -fno-tree-copyrename still helps).

Vladimir, perhaps this can be solved in IRA too?

Honza


-- 

hubicka at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||vmakarov at redhat dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-09-06 Thread hubicka at gcc dot gnu dot org


--- Comment #27 from hubicka at gcc dot gnu dot org  2008-09-06 12:02 
---
Also just noticed that offline copy of longest-match get extra move:
.L15:   
movzbl  2(%eax), %edi   #, tmp87
leal2(%eax), %ecx   #, scan.158
movl%edi, %edx  # tmp87,
cmpb2(%ebx), %dl#,
jne .L6 #,
movzbl  3(%eax), %edi   #, tmp88
leal3(%eax), %ecx   #, scan.158
movl%edi, %edx  # tmp88,
cmpb3(%ebx), %dl#,
jne .L6 #,  
movzbl  4(%eax), %edi   #, tmp89
leal4(%eax), %ecx   #, scan.158
movl%edi, %edx  # tmp89,
cmpb4(%ebx), %dl#,
jne .L6 #,
movzbl  5(%eax), %edi   #, tmp90
leal5(%eax), %ecx   #, scan.158
movl%edi, %edx  # tmp90,
cmpb5(%ebx), %dl#,
jne .L6 #,

while inlined copy is fine:
.L98:   
movzbl  1(%eax), %edx   #,
leal1(%eax), %edi   #, scan
cmpb1(%ecx), %dl#,
jne .L161   #,
movzbl  2(%eax), %edx   #,
leal2(%eax), %edi   #, scan
cmpb2(%ecx), %dl#,
jne .L161   #,
movzbl  3(%eax), %edx   #,
leal3(%eax), %edi   #, scan
cmpb3(%ecx), %dl#,
jne .L161   #,
movzbl  4(%eax), %edx   #,
leal4(%eax), %edi   #, scan
cmpb4(%ecx), %dl#,
jne .L161   #,
interesting :)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-08 Thread hubicka at gcc dot gnu dot org


--- Comment #25 from hubicka at gcc dot gnu dot org  2008-02-08 15:39 
---
-fno-tree-dominator-opts -fno-tree-copyrename solves the coalescing problem
(name is introduced by second, the actual problematic pattern by first pass),
saving roughly 1s at both -O2 and 2s at -O3, -O3 is still worse however
Internal loop no longer spills, just reads val of scan_end stored in memory.

I will play with it more later and make simple testcase for this.
Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-08 Thread hubicka at gcc dot gnu dot org


--- Comment #24 from hubicka at gcc dot gnu dot org  2008-02-08 15:11 
---
Hi,
the tonight runs with continue heuristics shows again improvements on 64bit
scores , but degradation on 32bit scores.  Looking into the loop, the real
trouble seems to be that the main loop has 6 loop carried variables:

scan_end, scan_end1, best_len, scan, chain_length, cur_match

plus few temporaries are needed too. Obviously we can't fit in registers on
i386. Making profile more realistic sometime helps sometimes hurts pretty much
at random basis.

One case where I think register presure is increased is the fact that different
SSA names of both scan_end and scan_end1 variables are actually not fully
coalesced in out-of-SSA.  This is result of optimizing:

if (match[best_len] != scan_end ||
match[best_len-1] != scan_end1 ||
*match != *scan ||
*++match != scan[1]) continue;
   ...later code sometimes modifying scan_end

into computing match[best_len] into name of scan_end that is sometimes assigned
int the later code on the path not modifying scan_end.  As a result we do have
two scan_ends live at once.  I wonder if we can avoid this behaviour, though it
looks all right on SSA form, it would save 2 global registers: there is no
need at all to cache match[best_len]/match[best_len1] in register unless I
missed something. Those two vars are manipulated on the hot paths through the
loop.

Now the RA is driven by frequencies (bit confused by fact that two of loop
carried vars are split) and by their liveranges that is actually number of
instructions in bettween first and last occurence.  Since we are bit carelless
on BB ordering moving some code to the very end of function, this heuristics is
not realistic at all.  It would probably make more sense to replace it by
number of inssn it is live across, but this is probably ninsn*npseudos to
compute. Other idea would be degree in conflict graph, but I am not sure we
want to start such experiemtns in parallel with YARA.

I tested YARA and it does not handle this situation much better. Perhaps
Vladimir can help?

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-07 Thread hubicka at gcc dot gnu dot org


--- Comment #23 from hubicka at gcc dot gnu dot org  2008-02-07 12:30 
---
Created an attachment (id=15115)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15115action=view)
Annotated profile

I am attaching dump with profile read in.  It shows the hot spots in
longest_match at least:

(this is first conditional of the continue guard)
  # BLOCK 27 freq:1 count:1346119696
  # PRED: 6 [100.0%]  count:112241556 (fallthru) 25 [99.5%]  count:1233878140
(true,exec)
  # scan_end_13 = PHI scan_end_106(6), scan_end_14(25)
  # scan_end1_11 = PHI scan_end1_93(6), scan_end1_12(25)
  # best_len_8 = PHI best_len_25(6), best_len_9(25)
  # scan_3 = PHI scan_24(6), scan_6(25)
  # chain_length_2 = PHI chain_length_108(6), chain_length_105(25)
  # cur_match_1 = PHI cur_match_109(6), cur_match_104(25)
  match_40 = window + cur_match_1;
  best_len.31_41 = (unsigned int) best_len_8;
  D.2379_42 = match_40 + best_len.31_41;
  D.2380_43 = *D.2379_42;
  if (D.2380_43 != scan_end_13)
goto bb 10;
  else
goto bb 7;

  # SUCC: 10 [0.1%]  count:33977 (true,exec) 11 [99.9%]  count:48979565
(false,exec)

  # BLOCK 10 freq:9636 count:1297140131
  # PRED: 27 [87.5%]  count:1177665163 (true,exec) 7 [55.2%]  count:93018627
(true,exec) 8 [35.0%]  count:26422364 (true,exec) 9 [0.1%]  count:33977
(true,exec)
  goto bb 24;

(this is the continue statement)

  D.2391_102 = cur_match_1  32767;
  D.2392_103 = prev[D.2391_102];
  cur_match_104 = (IPos) D.2392_103;
  if (limit_15 = cur_match_104)
goto bb 26;
  else
goto bb 25;
  # SUCC: 26 [7.7%]  count:104056913 (true,exec) 25 [92.3%]  count:1240391903
(false,exec)

  # BLOCK 25 freq:9215 count:1240391903
  # PRED: 24 [92.3%]  count:1240391903 (false,exec)
  chain_length_105 = chain_length_2 + 0x0;
  if (chain_length_105 != 0)
goto bb 27;
  else
goto bb 26;


(this is end of outer loop)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-06 Thread hubicka at gcc dot gnu dot org


--- Comment #17 from hubicka at gcc dot gnu dot org  2008-02-06 13:28 
---
One problem is the following:
  do {
;
match = window + cur_match;
if (match[best_len] != scan_end ||
match[best_len-1] != scan_end1 ||
*match != *scan ||
*++match != scan[1]) continue;
scan += 2, match++;
do {
} while (*++scan == *++match  *++scan == *++match 
 *++scan == *++match  *++scan == *++match 
 *++scan == *++match  *++scan == *++match 
 *++scan == *++match  *++scan == *++match 
 scan  strend);



The internal loop is the string comparsion thingy, while the branch prediction
logic completely misses it: the continue statement looks like it is forming 4
nested loops, so it concludes that this is the internal loop.

We used to have prediction heuristic guessing that continue statement is not
used to form a loop.  This was killed when gimplification was introduced. 
Perhaps we should bring it back, since this is resonably common scenario.

Looking at longest_match in not unrolled version, the loops formed by
continue statement has frequencies: 298, 961, 2139, 3100, 6900, 1000
so every loop is predicted to iterate about twice.

The outer real loop now gets frequency 92, ie small enough to be predicted as
cold.  The string comparsion loop now get freuqnecy 344, predicted to iterate 3
times (quite realistically).  But because the frequency is so small we end up
allocating one of the two pointers in memory:
.L9:
leal1(%ecx), %eax
movl%eax, -16(%ebp)
movzbl  1(%ecx), %eax
cmpb1(%edx), %al
jne .L8
leal2(%ecx), %eax
movl%eax, -16(%ebp)
movzbl  2(%ecx), %eax
cmpb2(%edx), %al
jne .L8
leal3(%ecx), %eax
movl%eax, -16(%ebp)
movzbl  3(%ecx), %eax
cmpb3(%edx), %al
jne .L8
leal4(%ecx), %eax
movl%eax, -16(%ebp)
movzbl  4(%ecx), %eax 
cmpb4(%edx), %al
jne .L8
leal5(%ecx), %eax
movl%eax, -16(%ebp)
movzbl  5(%ecx), %eax
cmpb5(%edx), %al
jne .L8

This happens in offline copy of longest_match. The inline gets this detail
right, but frequencies of the deflate functions are all crazy, naturally.

I guess I should revive the patch for language scope branch predictors.
Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-06 Thread hubicka at gcc dot gnu dot org


--- Comment #18 from hubicka at gcc dot gnu dot org  2008-02-06 16:44 
---
Created an attachment (id=15107)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15107action=view)
Path to predict_paths_leading_to

Hi,
I've revived the continue heuristic patch.  By itself it does not help becuase
of bug in predict_paths_leading_to.

The code looks as follows:

if (test1)
   goto continue_block;
if (test2)
   goto continue_block;
if (test3)
   goto continue_block;
if (test4)
   goto continue_block;
goto real_loop_body;
continue_block:
   goto loop_header;

We call predict_paths_leading_to on the continue_block and expect that the
continue_block will not be very likely.

What the function does is to find dominator of continue_block that is the
if(test1) block and predict edge from the first block.  This is however not
quite enough as all the other paths remain likely.

It seems to me that we need to walk the whole set of BBs postdominated by the
BB and mark all edges forming edge cut defined by this set.

I am testing the attached patch.  It makes the function linear (so we are
overall quadratic) for very deep postdominator tree.  If this turns out to be
problem, I think we can just cut the computation after some specified amount of
BBs is walked.

Zdenek, does this seem sane?

With this change and continue prediction patch I get sort of sane prediction
for longest_match function.  Profile is still quite unrealistic, but I am
testing if it makes noticeable difference.

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-06 Thread hubicka at gcc dot gnu dot org


--- Comment #19 from hubicka at gcc dot gnu dot org  2008-02-06 16:56 
---
Created an attachment (id=15108)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15108action=view)
Complete continue heuristic patch

Hi,
this is the complete patch.  With this patch we produce profile sane enough so
the internal loops are not marked cold.  I will benchmark it probably tomorrow
(I want to wait for the FP changes to show separately).

It fixes the offline copy of longest_match, so we no longer have one of IV
variables at stack:
.L15:
movzbl  2(%edx), %eax
leal2(%edx), %esi
cmpb2(%ecx), %al
jne .L8
movzbl  3(%edx), %eax
leal3(%edx), %esi
cmpb3(%ecx), %al
jne .L8 
movzbl  4(%edx), %eax
leal4(%edx), %esi
cmpb4(%ecx), %al
jne .L8
movzbl  5(%edx), %eax
leal5(%edx), %esi
cmpb5(%ecx), %al
jne .L8
movzbl  6(%edx), %eax
leal6(%edx), %esi
cmpb6(%ecx), %al
jne .L8
movzbl  7(%edx), %eax
leal7(%edx), %esi
cmpb7(%ecx), %al
jne .L8
leal8(%ecx), %eax
movl%eax, %ecx
movzbl  8(%edx), %eax
cmpb(%ecx), %al
leal8(%edx), %ebx
movl%ebx, %esi
jne .L8
cmpl%ebx, -20(%ebp)
jbe .L8
movl%ebx, %edx
movzbl  1(%edx), %eax
leal1(%edx), %esi
cmpb1(%ecx), %al
je  .L15

Irronically this can further widen the gap in between -O2 and -O3, since the
inline copy in deflate was always allocated resonably.
Deflate codegen changes quite a lot and because function body is big I will
wait for benchmarks before trying to analyze futher.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-06 Thread ubizjak at gmail dot com


--- Comment #20 from ubizjak at gmail dot com  2008-02-06 18:42 ---
Whoa, adding -fomit-frame-pointer brings us from

(gcc -O3 -m32)
user0m41.031s

to

(gcc -O3 -m32 -fomit-frame-pointer)
user0m30.006s

Since -fo-f-p adds another free reg, it looks that since inlining increases
register pressure some unlucky heavy-used variable gets allocated to the stack
slot.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-06 Thread ubizjak at gmail dot com


--- Comment #21 from ubizjak at gmail dot com  2008-02-06 19:10 ---
(In reply to comment #20)

 Since -fo-f-p adds another free reg, it looks that since inlining increases
 register pressure some unlucky heavy-used variable gets allocated to the stack
 slot.


It is best_len (and probably some others, too):

[EMAIL PROTECTED] gzip-1.2.4]$ grep best_len fp.s
movl%edx, -68(%ebp) #, best_len
movl-68(%ebp), %edx # best_len, best_len.494
movl%edx, -68(%ebp) # best_len.494, best_len
movl-68(%ebp), %edx # best_len,
movl-68(%ebp), %edx # best_len,
movl-68(%ebp), %edx # best_len, best_len.494
cmpl%esi, %edx  # lookahead, best_len.494
movl%edx, -108(%ebp)# best_len.494, match_length
movl-68(%ebp), %edx # best_len, best_len.494
movl%edx, -88(%ebp) # prev_length.28, best_len
movl-88(%ebp), %edx # best_len, best_len.457
movl%edx, -88(%ebp) # best_len.457, best_len
movl-88(%ebp), %eax # best_len,
movl-88(%ebp), %edx # best_len,
movl-88(%ebp), %edx # best_len, best_len.457
cmpl%esi, %edx  # lookahead, best_len.457
movl%edx, -40(%ebp) # best_len.457, match_length.404
movl-88(%ebp), %edx # best_len, best_len.457
leal(%ecx,%eax), %edx   #, best_len.457
cmpl%edx, -88(%ebp) # best_len.457, best_len
cmpl-96(%ebp), %edx # nice_match.34, best_len.457
leal(%ecx,%eax), %edx   #, best_len.494
cmpl%edx, -68(%ebp) # best_len.494, best_len
cmpl-76(%ebp), %edx # nice_match.34, best_len.494

[EMAIL PROTECTED] gzip-1.2.4]$ grep best_len no-fp.s
movl%edx, 76(%esp)  #, best_len
movl76(%esp), %edx  # best_len,
movl76(%esp), %edx  # best_len, best_len.494
movl%edx, 76(%esp)  # best_len.494, best_len
movl76(%esp), %eax  # best_len,
movl76(%esp), %edx  # best_len, best_len.494
movl%edx, %ebp  # best_len.494, match_length
movl76(%esp), %edx  # best_len, best_len.494
movl%edx, %ebp  # prev_length.28, best_len
movl%ebp, %edx  # best_len, best_len.457
movl%edx, %ebp  # best_len.457, best_len
movl%ebp, %edx  # best_len, best_len.457
cmpl%esi, %edx  # lookahead, best_len.457
movl%ebp, %edx  # best_len, best_len.457
leal(%ecx,%eax), %edx   #, best_len.494
cmpl%edx, 76(%esp)  # best_len.494, best_len
cmpl68(%esp), %edx  # nice_match.34, best_len.494
leal(%ecx,%eax), %edx   #, best_len.457
cmpl%edx, %ebp  # best_len.457, best_len
cmpl52(%esp), %edx  # nice_match.34, best_len.457


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-06 Thread hubicka at gcc dot gnu dot org


--- Comment #22 from hubicka at gcc dot gnu dot org  2008-02-06 19:22 
---
Yes, there are number of unlucky variables. However the real source is here
seems to be always wrong profile guiding regalloc to optimize for cold portions
of the function rather than real increase of register pressure increase due to
inlining.  

In general, inlining operation itself only decrease register pressure: you
don't fix function parameters/return value to fixed registers and you know
precisely what registers survive the body so you don't need to save caller
saved registers when not needed. 

The losses from inlining with our regalloc is partly due to callee saved
registers being sometimes more effective sort of immitating live range
splitting. Increased register pressure is effect of propagating from function
body to the rest of program, but it is not that bat either: at least all the
inlining heuristic/RA bugs turned to be something else.

The high speedup by forwprop patch in 64bit mode (and slowdown in 32bit) is
actually also register allocation related: the internal loop consisting of
sequence of ++ operations ends up with extra copy instructions without forwprop
patch, while with the patch we produce normal induction variable.  On 32bit it
however results in regalloc putting this variable on stack because its
liferange heuristics gives it lower priority then.

For 32bit data, britten 32-bit SPEC tester peaked at 760, while we now get 620
on peak with -fomit-frame-pointer. 20% regression on rather simple commonly
used codebase definitly makes us look stupid More though that ICC 7.x did
820 on same machine. 64bit tester is 830 versus 740 approximately.

Honza


-- 

hubicka at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-05 Thread hubicka at gcc dot gnu dot org


--- Comment #15 from hubicka at gcc dot gnu dot org  2008-02-05 13:36 
---
Thanks, looks comparable to K8 scores, except that -O3 is not actually that
worse there.  So it looks there is more than just random effect of code layout
involved, I will try to look into the assembly produced more.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-05 Thread hubicka at gcc dot gnu dot org


--- Comment #16 from hubicka at gcc dot gnu dot org  2008-02-05 13:55 
---
Thanks, looks comparable to K8 scores, except that -O3 is not actually that
worse there.  So it looks there is more than just random effect of code layout
involved, I will try to look into the assembly produced more.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-03 Thread hubicka at gcc dot gnu dot org


--- Comment #13 from hubicka at gcc dot gnu dot org  2008-02-03 13:39 
---
Tonight runs on haydn with patch in shows regression on gzip: 950-901 in
32bit. FDO 64bit runs are not affected.

This is same score as we had in December, we improved a bit since then but not
enough to match score we used to have.  
Looks like codegen of the string compare loop is very unstable here.
Uros, would be possible to give it a try on Core?  That would help to figure
out if it is code layout problem of K8.

Honza


-- 

hubicka at gcc dot gnu dot org changed:

   What|Removed |Added

   Last reconfirmed|2007-12-10 10:14:39 |2008-02-03 13:39:42
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-03 Thread ubizjak at gmail dot com


--- Comment #14 from ubizjak at gmail dot com  2008-02-03 17:35 ---
(In reply to comment #13)

 Uros, would be possible to give it a try on Core?  That would help to figure
 out if it is code layout problem of K8.

Hm, the patch doesn't seem to help:

-m32 -O2: 32.434
-m32 -O2 (patched): 32.586

-m32 -O3: 40.723
-m32 -O3 (patched): 41.059


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-02-02 Thread hubicka at gcc dot gnu dot org


--- Comment #12 from hubicka at gcc dot gnu dot org  2008-02-02 16:22 
---
Created an attachment (id=15079)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15079action=view)
address accumulation patch

While working on PR17863 I wrote the attached patch to make fwprop to combine
code like:

a=base;
*a=something;
a++;
*a=something;
a++;
*a=something;
...

into

*base=something
a=base+1
*a=something
a=base+2
*a=something


I dropped it to vangelis and nightly tester shows gzip improvement 815-880.
Gzip internal loop is hand unrolled into similar form as shown above.
(the tester peaks in Jul 2005 with scores somewhat above 900). Since it gzip
results tends to be unstable it would be nice to know how this reproduce on
other targets/setups.

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2008-01-16 Thread hubicka at gcc dot gnu dot org


--- Comment #11 from hubicka at gcc dot gnu dot org  2008-01-16 16:46 
---
Last time I looked into it, it was code
  alignment affected by inlining in the string matching loop (longest_match). 
This code is very atypical, since the internal loop comparing strings is hand
unrolled but it almost never rolls, since the compressed strings tends to be
all different.  GCC mispredicts this   
  moving some stuff out of the loop and bb-reorder aligns the code in a
  way that the default path not doing
the loop is jumping pretty far
hurting decode bandwidth of K8 especially because the jumps are hard to
   predict. 

I don't see any direct things in the code heuristics can use to realize
   that the loop is not rooling, except for
special casing the particular
benchmark.  

FDO scores of gzip are not doing that bad, but there is still gap  
   relative to ICC (even archaic version of it
running 32bit compared to 64bit GCC).   
http://www.suse.de/~gcctest/SPEC-britten/CINT/sandbox-britten-FDO/index.html
It would be nice to convince gzip/zlibc/bzip2 people to use profiling by   
   default in the build process - those
packages are ideal targets.  

But since core is not that much sensitive to code alignment and nuber of   
   jumps as K8, perhaps there are extra
problems demonstrated by this.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2007-12-10 Thread rguenth at gcc dot gnu dot org


--- Comment #3 from rguenth at gcc dot gnu dot org  2007-12-10 10:52 ---
I don't think this qualifies as a 4.3 regression -
http://www.suse.de/~gcctest/SPEC/CINT/sb-haydn-head-64-32o-32bit/index.html
shows that while there were jumps, the numbers close to the 4.2 release are
actually quite similar to what we have now.  So, unless somebody produces
numbers with 4.2 or earlier, this is not a 'regression', but a
missed-optimization only.


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu dot
   ||org
  Component|target  |tree-optimization
   Keywords||missed-optimization
Summary|[4.3 regression] non-optimal|non-optimal inlining
   |inlining heuristics |heuristics pessimizes gzip
   |pessimizes gzip SPEC score  |SPEC score at -O3
   |at -O3  |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2007-12-10 Thread ubizjak at gmail dot com


--- Comment #4 from ubizjak at gmail dot com  2007-12-10 12:31 ---
(In reply to comment #3)
 I don't think this qualifies as a 4.3 regression -

Fair enough. It looks that this problem is specific to Core2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2007-12-10 Thread ubizjak at gmail dot com


--- Comment #5 from ubizjak at gmail dot com  2007-12-10 17:12 ---
(In reply to comment #4)

 Fair enough. It looks that this problem is specific to Core2.

Here are timings with 'gcc version 4.3.0 20071201 (experimental) [trunk
revision 130554] (GCC)' on

vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 CPU X6800  @ 2.93GHz
stepping: 5
cpu MHz : 2933.389
cache size  : 4096 KB

-mtune=generic -m32 -O3: 40.763s   [*]
-mtune=generic -m32 -O2: 32.170s
-mtune=core2 -m32 -O3  : 36.850s
-mtune=core2 -m32 -O2  : 32.170s

-mtune=generic -m64 -O3: 28.550s
-mtune=generic -m64 -O2: 28.682s
-mtune=core2 -m64 -O3  : 28.670s
-mtune=core2 -m64 -O2  : 28.714s

With __attribute__((noinline)) to longest_match():

-mtune=generic -m32 -O3: 30.658s
-mtune=generic -m32 -O2: 32.154s
-mtune=core2 -m32 -O3  : 30.690s
-mtune=core2 -m32 -O2  : 32.247s

And with FC6 system compiler 'gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)':

-mtune=generic -m32 -O3: 30.154s   [**]
-mtune=generic -m32 -O2: 30.275s

Comparing [*] to [**], it _is_ a regression, at least on Core2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2007-12-10 Thread rguenther at suse dot de


--- Comment #6 from rguenther at suse dot de  2007-12-10 17:13 ---
Subject: Re:  non-optimal inlining heuristics
 pessimizes gzip SPEC score at -O3

On Mon, 10 Dec 2007, ubizjak at gmail dot com wrote:

 (In reply to comment #4)
 
  Fair enough. It looks that this problem is specific to Core2.
 
 Here are timings with 'gcc version 4.3.0 20071201 (experimental) [trunk
 revision 130554] (GCC)' on
 
 vendor_id   : GenuineIntel
 cpu family  : 6
 model   : 15
 model name  : Intel(R) Core(TM)2 CPU X6800  @ 2.93GHz
 stepping: 5
 cpu MHz : 2933.389
 cache size  : 4096 KB
 
 -mtune=generic -m32 -O3: 40.763s   [*]
 -mtune=generic -m32 -O2: 32.170s
 -mtune=core2 -m32 -O3  : 36.850s
 -mtune=core2 -m32 -O2  : 32.170s
 
 -mtune=generic -m64 -O3: 28.550s
 -mtune=generic -m64 -O2: 28.682s
 -mtune=core2 -m64 -O3  : 28.670s
 -mtune=core2 -m64 -O2  : 28.714s
 
 With __attribute__((noinline)) to longest_match():
 
 -mtune=generic -m32 -O3: 30.658s
 -mtune=generic -m32 -O2: 32.154s
 -mtune=core2 -m32 -O3  : 30.690s
 -mtune=core2 -m32 -O2  : 32.247s
 
 And with FC6 system compiler 'gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)':
 
 -mtune=generic -m32 -O3: 30.154s   [**]
 -mtune=generic -m32 -O2: 30.275s
 
 Comparing [*] to [**], it _is_ a regression, at least on Core2.

FSF GCC 4.1 does not have -mtune=generic.

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761



[Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3

2007-12-10 Thread ubizjak at gmail dot com


--- Comment #7 from ubizjak at gmail dot com  2007-12-10 17:26 ---
(In reply to comment #6)

 FSF GCC 4.1 does not have -mtune=generic.

OK, OK. Now with 'gcc version 4.1.3 20070716 (prerelease)':

-m32 -O2: 29.306s
-m32 -O3: 29.582s

I don't have 4.2 here.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761