Ken,
On Fri, Jul 3, 2009 at 10:01 PM, Kenneth Hoste<kenneth.ho...@ugent.be> wrote: > > On Jul 1, 2009, at 07:03 , stephane eranian wrote: > >> Kenneth, >> >> Let me check on this with Intel. > Unfortunately, no news on this front. I don't know how you've configured the Core i7, but I think it could be interesting to see what happens when you: - disable threads - disable hardware prefetch, adjacent prefetch via IA32_MISC_ENABLE MSR In other words, try to recreate simpler conditions. > Thanks! Any news yet? > >> It does not appear like there is a bug in either Core 2 or Core i7 event >> tables >> based on existing documentation. > > Just today, we made another strange observation. We were seeing a similar > pattern > for L1 data cache misses as we were seeing for L1 instruction cache, i.e. a > huge > overcount for several SPEC CPU2000 benchmarks. > > A small plot illustrating the issue is attached to this mail. > > > > On the Core2 system, I'm using the MEM_LOAD_RETIRED.L1D_MISS event (blue > bars), while > on the Core i7 system, I was using the L1D_CACHE_LD:I_STATE event (red > bars), counting the number > of cache lines accessed which were in the invalid state, which should be the > same as cache misses. > Although the latter event probably also includes speculative data accesses, > the large overcounts don't make > sense because the L1-D cache on both systems is identical to our knowledge > (both in size and associativity). > > One noticable example is swim, for which 10x more misses are counted on the > Core i7 system. Even with > speculative accesses (e.g. prefetches and accesses on a faulty predicted > code path), this doesn't make sense to us. > > By combining four different event masks from the MEM_LOAD_RETIRED event (see > graph), we were able to > obtain a count for the L1 data cache misses which made sense, i.e. which > were equal or lower than the Core2 counts. > Lower is possible because of the new victim cache and loop buffer on the > Core i7, > besides a possibly improved prefetching algorithm. > > I don't know if we're missing something obvious here, or if this is related > to the L1 instruction cache issue, but > we felt like we needed to share this... > > greetings, > > Kenneth > >> >> >> On Tue, Jun 30, 2009 at 2:45 PM, Kenneth Hoste<kenneth.ho...@ugent.be> >> wrote: >>> >>> Hello, >>> >>> Just now, Stijn (in CC), a colleague of mine, and I have been seeing >>> some weird >>> counts on a Core i7 machine for the SPEC CPU2000 and CPU2006 workloads, >>> more specifically for the L1 instruction cache misses. >>> >>> Comparing the counts on Core i7 with those obtained on a Core 2, Stijn >>> noticed >>> unexpected differences, i.e. large overcounts for the Core i7. This is >>> strange, because >>> the L1 instruction caches on both types of processors are equally big >>> (32k), and the more >>> recent Core i7 has additional features such as a victim cache and a >>> stream buffer cache. >>> So, the counts should be (slightly?) lower instead of higher... >>> >>> I'm using the perfex tool that comes with the perfctr kernel patch on >>> both systems, >>> and also the pfmon tool on the Core i7 system to validate the counts. >>> On the Core2, I'm using the L1I_MISSES event (event code 81h), on the >>> Core i7 >>> I'm using the L1I.MISSES event (event code 80h with mask 02h). >>> More specifically: >>> >>> *) Core 2: >>> >>> perfex -e 0x410081 ./gcc 200.i -o 200.s >>> >>> *) Core i7: >>> >>> perfex -e 0x410280 ./gcc 200.i -o 200.s >>> and >>> pfmon -e L1I:MISSES ./gcc 200.i -o 200.s >>> >>> One example is CPU2000's gcc with the 200.s reference input set. >>> >>> On the Core 2 we counted ~76M (million) L1-I misses. Also counting >>> the cycles during which the instruction decoder is stalled due to the >>> misses >>> leads to an estimation of roughly 19 cycles penalty for each L1-I >>> miss, which >>> makes perfect sense, because the latency of the L2 cache is about 19 >>> cycles. >>> >>> On the Core i7 system we counted ~292M L1-I misses, thus a lot more >>> than on the Core 2 system with the same L1-I cache size. Also counting >>> cycles during which the decoder is stalled yields of penalty of ~2.1 >>> cycles/miss, >>> a surprisingly low number because the L2 cache latency is >>> significantly higher. >>> >>> So, our conclusion is that the L1-I misses event on the Core i7 isn't >>> counting what >>> is claimed. The documentation says that the L1I.MISSES event also >>> includes >>> streaming buffer and victim cache misses, but to our knowledge those >>> are only >>> looked at if the request already misses the L1-I cache. And it says >>> explicitly that >>> every L1-I miss is only counted once... >>> >>> Does anyone have suggestions on what we might be seeing here? Is it >>> a problem with the event, or are we misinterpreting what the event is >>> actually counting? >>> >>> Any comments/suggestions are highly appreciated... >>> >>> greetings, >>> >>> Kenneth >>> -- >>> >>> Kenneth Hoste >>> Paris research group - ELIS - Ghent University, Belgium >>> email: kenneth.ho...@elis.ugent.be >>> website: http://www.elis.ugent.be/~kehoste >>> blog: http://boegel.kejo.be >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> perfmon2-devel mailing list >>> perfmon2-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel >>> > > -- > > Kenneth Hoste > Paris research group - ELIS - Ghent University, Belgium > email: kenneth.ho...@elis.ugent.be > website: http://www.elis.ugent.be/~kehoste > blog: http://boegel.kejo.be > > > ------------------------------------------------------------------------------ _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel