gcc cache misses [was: Re: OT: How is memory latency important on AMD64 box while compiling large C/C++ sources]

Karel Gardas Tue, 12 Apr 2005 12:59:14 -0700

On Tue, 12 Apr 2005, Mike Stump wrote:

> On Tuesday, April 12, 2005, at 06:38  AM, Karel Gardas wrote:
> > Especially: ``Currently gcc takes a cache miss every 20 instructions,
> > or
> > some ungodly number, and that really saps performance.''
> >
> > but I don't know if this is just an 1st April fool joke
>
> Nope, no joke.  The exact number will vary from machine to machine, and
> testcase to testcase, but it is much lower than most workloads.
>
> >  or the reality and if I understand "cache miss" right and if this is
> > L1 or L2 cache miss.
>
> D3 miss as I recall.
>
> cachegrind can also be used to estimate the number (though, not sure
> how accurate it is, possibly not very).  I use Shark to actually get
> the real number.


Perhaps it's possible that cachegrind is wrong or cache misses differ from
platform to platform, but I would tell that I get very good numbers for
gcc running on x86 platform:

==2634== I   refs:      6,930,091,914
==2634== I1  misses:       21,057,598
==2634== L2i misses:        1,748,958
==2634== I1  miss rate:          0.30%
==2634== L2i miss rate:           0.2%
==2634==
==2634== D   refs:      3,547,549,621  (2,283,456,901 rd + 1,264,092,720 wr)
==2634== D1  misses:       27,091,035  (   24,245,031 rd +     2,846,004 wr)
==2634== L2d misses:        9,560,838  (    7,447,877 rd +     2,112,961 wr)
==2634== D1  miss rate:           0.7% (          1.0%   +           0.2%  )
==2634== L2d miss rate:           0.2% (          0.3%   +           0.1%  )
==2634==
==2634== L2 refs:          48,148,633  (   45,302,629 rd +     2,846,004 wr)
==2634== L2 misses:        11,309,796  (    9,196,835 rd +     2,112,961 wr)
==2634== L2 miss rate:            0.1% (          0.0%   +           0.1%  )
--2634--
--2634-- Distinct files:   161
--2634-- Distinct fns:     2988
--2634-- BB lookups:       296865
--2634-- With full      debug info: 96% (286724)
--2634-- With file/line debug info:  0% (0)
--2634-- With fn name   debug info:  2% (7538)
--2634-- With no        debug info:  0% (2603)
--2634-- BBs Retranslated: 0
--2634-- Distinct instrs:  243399
--2634--     TT/TC: 0 tc sectors discarded.
--2634--            56030 chainings, 0 unchainings.
--2634-- translate: new     53466 (817978 -> 7795557; ratio 95:10)
--2634--            discard 0 (0 -> 0; ratio 0:10).
--2634--  dispatch: 1408000000 jumps (bb entries), of which 229222501 (16%) 
were unchained.
--2634--            28161/1069704 major/minor sched events.  540307 tt_fast 
misses.
--2634-- reg-alloc: 398 t-req-spill, 1509791+1072 orig+spill uis, 196410 
total-reg-r.
--2634--    sanity: 28162 cheap, 1127 expensive checks.
--2634--    ccalls: 243735 C calls, 81% saves+restores avoided (1183722 bytes)
--2634--            379914 args, avg 0.71 setup instrs each (214802 bytes)
--2634--            0% clear the stack (731205 bytes)
--2634--            0 retvals, 100% of reg-reg movs avoided (0 bytes)


that's with valgrind 1.9.8 "simulating" AMD64 512KB L2 cache processor:

==2634== Startup, with flags:
==2634==    --suppressions=/opt/valgrind/lib/valgrind/default.supp
==2634==    --I1=65536,2,64
==2634==    --D1=65536,2,64
==2634==    --L2=524288,8,64
==2634==    -v
==2634== Cache configuration used:
==2634==   I1: 65536B, 2-way, 64B lines
==2634==   D1: 65536B, 2-way, 64B lines
==2634==   L2: 524288B, 8-way, 64B lines


The running program is gcc3.4.2 compiling one of MICO demos (i.e. quite a
load of C++ headers). Just to be sute that my valgrind is reporting
"correct" numbers, I've tested compiling of simple C++ hello world
(iostream-based) on real Opteron and the numbers (obtained from valgrind
2.2.0 on FC3) were also quite optimistic (actually here gcc running is
3.3.x):


==4107== I   refs:      568,524,260
==4107== I1  misses:      3,448,484
==4107== L2i misses:         60,065
==4107== I1  miss rate:        0.60%
==4107== L2i miss rate:         0.1%
==4107==
==4107== D   refs:      303,765,394  (187,496,999 rd + 116,268,395 wr)
==4107== D1  misses:      2,397,678  (  1,937,986 rd +     459,692 wr)
==4107== L2d misses:        462,261  (    141,702 rd +     320,559 wr)
==4107== D1  miss rate:         0.7% (        1.0%   +         0.3%  )
==4107== L2d miss rate:         0.1% (        0.0%   +         0.2%  )
==4107==
==4107== L2 refs:         5,846,162  (  5,386,470 rd +     459,692 wr)
==4107== L2 misses:         522,326  (    201,767 rd +     320,559 wr)
==4107== L2 miss rate:          0.0% (        0.0%   +         0.2%  )
--4107--
--4107-- Distinct files:   1
--4107-- Distinct fns:     221
--4107-- Distinct lines:   221
--4107-- Distinct instrs:  43222
--4107-- BB lookups:       192038
--4107-- With full      debug info:  0% (0)
--4107-- With file/line debug info:  0% (0)
--4107-- With fn name   debug info:  9% (18970)
--4107-- With no        debug info: 90% (173068)
--4107-- BBs Retranslated: 0
--4107--     TT/TC: 0 tc sectors discarded.
--4107--            115853 tt_fast misses.
--4107-- translate: new     43222 (637547 -> 6316429; ratio 99:10)
--4107--            discard 0 (0 -> 0; ratio 0:10).
--4107-- chainings: 42936 chainings, 0 unchainings.
--4107--  dispatch: 117400000 jumps (bb entries); of them 16886447 (14%) 
unchained.
--4107--            3015/119664 major/minor sched events.
--4107-- reg-alloc: 494 t-req-spill, 1187932+1406 orig+spill uis,
--4107--            158280 total-reg-rank
--4107--    sanity: 3016 cheap, 121 expensive checks.
--4107--    ccalls: 192139 C calls, 81% saves+restores avoided (926196 bytes)
--4107--            301769 args, avg 0.72 setup instrs each (167792 bytes)
--4107--            0% clear the stack (576114 bytes)
--4107--            101 retvals, 40% of reg-reg movs avoided (80 bytes)


both compilation were with using -O0 optimization level.


> If you can get the SPEC ratings of the machine, you can then just pull
> out the gcc specint number, and have a rough guess what type of compile
> time performance you would get.  A open mosix cluster with 4 cheap
> machines I suspect will compile faster (prive/performance) than one
> big, expensive box (rough guess).
>
> We talked about this before, see:
>
> http://gcc.gnu.org/ml/gcc/2002-08/msg00853.html
> http://gcc.gnu.org/ml/gcc/2002-08/msg00886.html
> http://gcc.gnu.org/ml/gcc/2002-08/msg01174.html
> http://gcc.gnu.org/ml/gcc/2002-08/msg00763.html
>
> for examples...

Either cachegrind is wrong, or gcc gets much better from that time? Or do
I interpret cachegrind provided data in the wrong way? What do you think
about it?

Thanks,
Karel
--
Karel Gardas                  [EMAIL PROTECTED]
ObjectSecurity Ltd.           http://www.objectsecurity.com

gcc cache misses [was: Re: OT: How is memory latency important on AMD64 box while compiling large C/C++ sources]

Reply via email to