Ken,

On Fri, Jul 3, 2009 at 10:01 PM, Kenneth Hoste<kenneth.ho...@ugent.be> wrote:
>
> On Jul 1, 2009, at 07:03 , stephane eranian wrote:
>
>> Kenneth,
>>
>> Let me check on this with Intel.
>
Unfortunately, no news on this front.

I don't know how you've configured the Core i7, but I think
it could be interesting to see what happens when you:
  - disable threads
  - disable hardware prefetch, adjacent prefetch via IA32_MISC_ENABLE MSR

In other words, try to recreate simpler conditions.


> Thanks! Any news yet?
>
>> It does not appear like there is a bug in either Core 2 or Core i7 event
>> tables
>> based on existing documentation.
>
> Just today, we made another strange observation. We were seeing a similar
> pattern
> for L1 data cache misses as we were seeing for L1 instruction cache, i.e. a
> huge
> overcount for several SPEC CPU2000 benchmarks.
>
> A small plot illustrating the issue is attached to this mail.
>
>
>
> On the Core2 system, I'm using the MEM_LOAD_RETIRED.L1D_MISS event (blue
> bars), while
> on the Core i7 system, I was using the L1D_CACHE_LD:I_STATE event (red
> bars), counting the number
> of cache lines accessed which were in the invalid state, which should be the
> same as cache misses.
> Although the latter event probably also includes speculative data accesses,
> the large overcounts don't make
> sense because the L1-D cache on both systems is identical to our knowledge
> (both in size and associativity).
>
> One noticable example is swim, for which 10x more misses are counted on the
> Core i7 system. Even with
> speculative accesses (e.g. prefetches and accesses on a faulty predicted
> code path), this doesn't make sense to us.
>
> By combining four different event masks from the MEM_LOAD_RETIRED event (see
> graph), we were able to
> obtain a count for the L1 data cache misses which made sense, i.e. which
> were equal or lower than the Core2 counts.
> Lower is possible because of the new victim cache and loop buffer on the
> Core i7,
> besides a possibly improved prefetching algorithm.
>
> I don't know if we're missing something obvious here, or if this is related
> to the L1 instruction cache issue, but
> we felt like we needed to share this...
>
> greetings,
>
> Kenneth
>
>>
>>
>> On Tue, Jun 30, 2009 at 2:45 PM, Kenneth Hoste<kenneth.ho...@ugent.be>
>> wrote:
>>>
>>> Hello,
>>>
>>> Just now, Stijn (in CC), a colleague of mine, and I have been seeing
>>> some weird
>>> counts on a Core i7 machine for the SPEC CPU2000 and CPU2006 workloads,
>>> more specifically for the L1 instruction cache misses.
>>>
>>> Comparing the counts on Core i7 with those obtained on a Core 2, Stijn
>>> noticed
>>> unexpected differences, i.e. large overcounts for the Core i7. This is
>>> strange, because
>>> the L1 instruction caches on both types of processors are equally big
>>> (32k), and the more
>>> recent Core i7 has additional features such as a victim cache and a
>>> stream buffer cache.
>>> So, the counts should be (slightly?) lower instead of higher...
>>>
>>> I'm using the perfex tool that comes with the perfctr kernel patch on
>>> both systems,
>>> and also the pfmon tool on the Core i7 system to validate the counts.
>>> On the Core2, I'm using the L1I_MISSES event (event code 81h), on the
>>> Core i7
>>> I'm using the L1I.MISSES event (event code 80h with mask 02h).
>>> More specifically:
>>>
>>> *) Core 2:
>>>
>>> perfex -e 0x410081 ./gcc 200.i -o 200.s
>>>
>>> *) Core i7:
>>>
>>> perfex -e 0x410280 ./gcc 200.i -o 200.s
>>> and
>>> pfmon -e L1I:MISSES ./gcc 200.i -o 200.s
>>>
>>> One example is CPU2000's gcc with the 200.s reference input set.
>>>
>>> On the Core 2 we counted ~76M (million) L1-I misses. Also counting
>>> the cycles during which the instruction decoder is stalled due to the
>>> misses
>>> leads to an estimation of roughly 19 cycles penalty for each L1-I
>>> miss, which
>>> makes perfect sense, because the latency of the L2 cache is about 19
>>> cycles.
>>>
>>> On the Core i7 system we counted ~292M L1-I misses, thus a lot more
>>> than on the Core 2 system with the same L1-I cache size. Also counting
>>> cycles during which the decoder is stalled yields of penalty of  ~2.1
>>> cycles/miss,
>>> a surprisingly low number because the L2 cache latency is
>>> significantly higher.
>>>
>>> So, our conclusion is that the L1-I misses event on the Core i7 isn't
>>> counting what
>>> is claimed. The documentation says that the L1I.MISSES event also
>>> includes
>>> streaming buffer and victim cache misses, but to our knowledge those
>>> are only
>>> looked at if the request already misses the L1-I cache. And it says
>>> explicitly that
>>> every L1-I miss is only counted once...
>>>
>>> Does anyone have suggestions on what we might be seeing here? Is it
>>> a problem with the event, or are we misinterpreting what the event is
>>> actually counting?
>>>
>>> Any comments/suggestions are highly appreciated...
>>>
>>> greetings,
>>>
>>> Kenneth
>>> --
>>>
>>> Kenneth Hoste
>>> Paris research group - ELIS - Ghent University, Belgium
>>> email: kenneth.ho...@elis.ugent.be
>>> website: http://www.elis.ugent.be/~kehoste
>>> blog: http://boegel.kejo.be
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> perfmon2-devel mailing list
>>> perfmon2-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>>>
>
> --
>
> Kenneth Hoste
> Paris research group - ELIS - Ghent University, Belgium
> email: kenneth.ho...@elis.ugent.be
> website: http://www.elis.ugent.be/~kehoste
> blog: http://boegel.kejo.be
>
>
>

------------------------------------------------------------------------------
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Reply via email to