Some instructions cause the hardware thread to stall for a few cycles; for your simple test, they are probably taken branches, and loads (even those that hit in L1$). Disassemble your loop and it will probably be obvious.
When multiple threads run on the core, the stall cycles for one thread are consumed by useful work performed by another thread. - Steve Sistare Elad Lahav wrote: > I am toying around with a T1000 machine (T1 1GHz processor, 8 cores, > 4-threads per core, 8GB RAM). I was unable to saturate a single Gigabit NIC > with netperf, so I started investigating with the help of performance > counters. It turns out that even a simple for loop that only increments a > counter can do at most 250 million instructions per second (hardly any > cache/TLB misses, as expected). From my understanding of the Niagara > architecture, a single thread executing on a core should be able to fully > utilise it (1 billion instructions per second in my case). > > What am I missing? > > Thanks, Elad > > P.S., I am tracking performance with cputrack -c Instr_cnt,sys > > > This message posted from opensolaris.org > _______________________________________________ perf-discuss mailing list > perf-discuss@opensolaris.org _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org