Hi all,

Today I followed up my investigations on this subject.

Concerning the first question about, remote cache samples, I am still
not able to understand  why i get this kind of events instead of Local
ram accesses. I modified my memory allocation function to use
numa_alloc function to be sure the memory is physically allocated
where I want. When the memory is allocated on the same node as the
thread running my program, I still have remote cache accesses and no
ram accesses. When it's on another node, I have remote ram accesses as
expected.

Does anybody have an explanation of the exact meaning of these remote
cache events that could explain why I get them ?

Regarding the the number of samples, I first checked the number of
memory loads generated by the sampled function (I mean here
statically, looking at the code generated by gcc). I then compared
this number with the core event: MEM_INST_RETIRED.LOADS, they are very
close (measured = 15001804, expected = 15001804). The difference must
come from the loads generated by the start/stop ioctl calls and maybe
other thing happening behind the scene I must miss, anyway this is
coherent. With a sampling period of 1000 events, I only have 455
samples. For period = 5000 I have 91 samples, and for period = 10000
45 samples.

So my question is what is the reason for this low number of samples ?

1- Some memory accesses have a threshold smaller than 3 cycles and are
thus not counted at all ?

2- Is there any relation to the time taken by  the PMI handler being
too large. My benchmarking code is only doing loads and no computation
at all ?

3- any other idea ?

Manu

PS: Because this list is called perf-users, I should maybe ask this
kind of question on another list. If it's the case, please let me know
where ? Thanks.

2014/1/17 Manuel Selva <selva.man...@gmail.com>:
> Hi all,
>
> I wrote a benchmarking program, in order to play with perf_event_open memory
> sampling capabilities (as discussed earlier on this list).
>
> My benchmark is allocating a large array of memory (120 megas by default),
> then I am starting sampling with perf_event_open and the
> MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the
> minimal one according to Intel's doc) along with memory accesses counting
> (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory
> either sequentially or randomly. The benchmark is mono thread, pinned to a
> given core, and memory is allocate don the core associated with this node
> using numa_alloc functions. perf_event_open is thus called to monitor only
> this core/numa node. The code is compiled without any optimization and I
> tried as possible to increase the ratio of memory accesses codes vs
> branching, and other stuff code.
>
> I can clearly see with the QMC_NORMAL_READ.ANY events count that my random
> accesses test case generate far more memory accesses than the sequential one
> (I guess the prefetcher and sequential access are responsible for that).
>
> Regarding the sampled event, I successfully mmap the result of perf event
> open and I am able to read the samples. My problem is that even for the
> random accesses test case on the 120 megas, I don't have any samples served
> by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period
> is 1000 events and I only get ~700 samples). Nevertheless, in the sequential
> case I have 0,01% of my samples that are remote cache accesses (1 Hop) where
> as this percentage is ~20% in the random case.
>
> So my questions are:
>
> - What is exactly remote cache (1 hop) ? A data found in another core's
> private cache (L1 or L2 in my case) on the same processor ?
>
> - How can I interpret my results, I was expecting to have local memory
> accesses samples increasing in the random case instead of remote cache (1
> hop). How can I have remote cache accesses with malloced data not shared,
> and used by a thread pinned to a given core.
>
> - I didn't yet look at that on details, but 700~ samples for accessing 120
> megas with sampling every 1000 events seems small. I am going to check the
> assembly code generated and also count the total number of memory requests
> (a core event) and compare that to the 700 samples * 1000 events.
>
> Thanks in advance for any suggestions you may have in order to help
> understand what's happening there.
>
> --
> Manu
--
To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to