On Tue, 12 Apr 2005, Mike Stump wrote: > On Tuesday, April 12, 2005, at 06:38 AM, Karel Gardas wrote: > > Especially: ``Currently gcc takes a cache miss every 20 instructions, > > or > > some ungodly number, and that really saps performance.'' > > > > but I don't know if this is just an 1st April fool joke > > Nope, no joke. The exact number will vary from machine to machine, and > testcase to testcase, but it is much lower than most workloads. > > > or the reality and if I understand "cache miss" right and if this is > > L1 or L2 cache miss. > > D3 miss as I recall. > > cachegrind can also be used to estimate the number (though, not sure > how accurate it is, possibly not very). I use Shark to actually get > the real number.
Perhaps it's possible that cachegrind is wrong or cache misses differ from platform to platform, but I would tell that I get very good numbers for gcc running on x86 platform: ==2634== I refs: 6,930,091,914 ==2634== I1 misses: 21,057,598 ==2634== L2i misses: 1,748,958 ==2634== I1 miss rate: 0.30% ==2634== L2i miss rate: 0.2% ==2634== ==2634== D refs: 3,547,549,621 (2,283,456,901 rd + 1,264,092,720 wr) ==2634== D1 misses: 27,091,035 ( 24,245,031 rd + 2,846,004 wr) ==2634== L2d misses: 9,560,838 ( 7,447,877 rd + 2,112,961 wr) ==2634== D1 miss rate: 0.7% ( 1.0% + 0.2% ) ==2634== L2d miss rate: 0.2% ( 0.3% + 0.1% ) ==2634== ==2634== L2 refs: 48,148,633 ( 45,302,629 rd + 2,846,004 wr) ==2634== L2 misses: 11,309,796 ( 9,196,835 rd + 2,112,961 wr) ==2634== L2 miss rate: 0.1% ( 0.0% + 0.1% ) --2634-- --2634-- Distinct files: 161 --2634-- Distinct fns: 2988 --2634-- BB lookups: 296865 --2634-- With full debug info: 96% (286724) --2634-- With file/line debug info: 0% (0) --2634-- With fn name debug info: 2% (7538) --2634-- With no debug info: 0% (2603) --2634-- BBs Retranslated: 0 --2634-- Distinct instrs: 243399 --2634-- TT/TC: 0 tc sectors discarded. --2634-- 56030 chainings, 0 unchainings. --2634-- translate: new 53466 (817978 -> 7795557; ratio 95:10) --2634-- discard 0 (0 -> 0; ratio 0:10). --2634-- dispatch: 1408000000 jumps (bb entries), of which 229222501 (16%) were unchained. --2634-- 28161/1069704 major/minor sched events. 540307 tt_fast misses. --2634-- reg-alloc: 398 t-req-spill, 1509791+1072 orig+spill uis, 196410 total-reg-r. --2634-- sanity: 28162 cheap, 1127 expensive checks. --2634-- ccalls: 243735 C calls, 81% saves+restores avoided (1183722 bytes) --2634-- 379914 args, avg 0.71 setup instrs each (214802 bytes) --2634-- 0% clear the stack (731205 bytes) --2634-- 0 retvals, 100% of reg-reg movs avoided (0 bytes) that's with valgrind 1.9.8 "simulating" AMD64 512KB L2 cache processor: ==2634== Startup, with flags: ==2634== --suppressions=/opt/valgrind/lib/valgrind/default.supp ==2634== --I1=65536,2,64 ==2634== --D1=65536,2,64 ==2634== --L2=524288,8,64 ==2634== -v ==2634== Cache configuration used: ==2634== I1: 65536B, 2-way, 64B lines ==2634== D1: 65536B, 2-way, 64B lines ==2634== L2: 524288B, 8-way, 64B lines The running program is gcc3.4.2 compiling one of MICO demos (i.e. quite a load of C++ headers). Just to be sute that my valgrind is reporting "correct" numbers, I've tested compiling of simple C++ hello world (iostream-based) on real Opteron and the numbers (obtained from valgrind 2.2.0 on FC3) were also quite optimistic (actually here gcc running is 3.3.x): ==4107== I refs: 568,524,260 ==4107== I1 misses: 3,448,484 ==4107== L2i misses: 60,065 ==4107== I1 miss rate: 0.60% ==4107== L2i miss rate: 0.1% ==4107== ==4107== D refs: 303,765,394 (187,496,999 rd + 116,268,395 wr) ==4107== D1 misses: 2,397,678 ( 1,937,986 rd + 459,692 wr) ==4107== L2d misses: 462,261 ( 141,702 rd + 320,559 wr) ==4107== D1 miss rate: 0.7% ( 1.0% + 0.3% ) ==4107== L2d miss rate: 0.1% ( 0.0% + 0.2% ) ==4107== ==4107== L2 refs: 5,846,162 ( 5,386,470 rd + 459,692 wr) ==4107== L2 misses: 522,326 ( 201,767 rd + 320,559 wr) ==4107== L2 miss rate: 0.0% ( 0.0% + 0.2% ) --4107-- --4107-- Distinct files: 1 --4107-- Distinct fns: 221 --4107-- Distinct lines: 221 --4107-- Distinct instrs: 43222 --4107-- BB lookups: 192038 --4107-- With full debug info: 0% (0) --4107-- With file/line debug info: 0% (0) --4107-- With fn name debug info: 9% (18970) --4107-- With no debug info: 90% (173068) --4107-- BBs Retranslated: 0 --4107-- TT/TC: 0 tc sectors discarded. --4107-- 115853 tt_fast misses. --4107-- translate: new 43222 (637547 -> 6316429; ratio 99:10) --4107-- discard 0 (0 -> 0; ratio 0:10). --4107-- chainings: 42936 chainings, 0 unchainings. --4107-- dispatch: 117400000 jumps (bb entries); of them 16886447 (14%) unchained. --4107-- 3015/119664 major/minor sched events. --4107-- reg-alloc: 494 t-req-spill, 1187932+1406 orig+spill uis, --4107-- 158280 total-reg-rank --4107-- sanity: 3016 cheap, 121 expensive checks. --4107-- ccalls: 192139 C calls, 81% saves+restores avoided (926196 bytes) --4107-- 301769 args, avg 0.72 setup instrs each (167792 bytes) --4107-- 0% clear the stack (576114 bytes) --4107-- 101 retvals, 40% of reg-reg movs avoided (80 bytes) both compilation were with using -O0 optimization level. > If you can get the SPEC ratings of the machine, you can then just pull > out the gcc specint number, and have a rough guess what type of compile > time performance you would get. A open mosix cluster with 4 cheap > machines I suspect will compile faster (prive/performance) than one > big, expensive box (rough guess). > > We talked about this before, see: > > http://gcc.gnu.org/ml/gcc/2002-08/msg00853.html > http://gcc.gnu.org/ml/gcc/2002-08/msg00886.html > http://gcc.gnu.org/ml/gcc/2002-08/msg01174.html > http://gcc.gnu.org/ml/gcc/2002-08/msg00763.html > > for examples... Either cachegrind is wrong, or gcc gets much better from that time? Or do I interpret cachegrind provided data in the wrong way? What do you think about it? Thanks, Karel -- Karel Gardas [EMAIL PROTECTED] ObjectSecurity Ltd. http://www.objectsecurity.com