Hi all,

Anyone can help on this subject about low number of sample and remote
cache samples meaning ?

Thanks,

Manu

2014-01-23 Manuel Selva <selva.man...@gmail.com>:
> Hi all,
>
> Today I followed up my investigations on this subject.
>
> Concerning the first question about, remote cache samples, I am still
> not able to understand  why i get this kind of events instead of Local
> ram accesses. I modified my memory allocation function to use
> numa_alloc function to be sure the memory is physically allocated
> where I want. When the memory is allocated on the same node as the
> thread running my program, I still have remote cache accesses and no
> ram accesses. When it's on another node, I have remote ram accesses as
> expected.
>
> Does anybody have an explanation of the exact meaning of these remote
> cache events that could explain why I get them ?
>
> Regarding the the number of samples, I first checked the number of
> memory loads generated by the sampled function (I mean here
> statically, looking at the code generated by gcc). I then compared
> this number with the core event: MEM_INST_RETIRED.LOADS, they are very
> close (measured = 15001804, expected = 15001804). The difference must
> come from the loads generated by the start/stop ioctl calls and maybe
> other thing happening behind the scene I must miss, anyway this is
> coherent. With a sampling period of 1000 events, I only have 455
> samples. For period = 5000 I have 91 samples, and for period = 10000
> 45 samples.
>
> So my question is what is the reason for this low number of samples ?
>
> 1- Some memory accesses have a threshold smaller than 3 cycles and are
> thus not counted at all ?
>
> 2- Is there any relation to the time taken by  the PMI handler being
> too large. My benchmarking code is only doing loads and no computation
> at all ?
>
> 3- any other idea ?
>
> Manu
>
> PS: Because this list is called perf-users, I should maybe ask this
> kind of question on another list. If it's the case, please let me know
> where ? Thanks.
>
> 2014/1/17 Manuel Selva <selva.man...@gmail.com>:
>> Hi all,
>>
>> I wrote a benchmarking program, in order to play with perf_event_open memory
>> sampling capabilities (as discussed earlier on this list).
>>
>> My benchmark is allocating a large array of memory (120 megas by default),
>> then I am starting sampling with perf_event_open and the
>> MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the
>> minimal one according to Intel's doc) along with memory accesses counting
>> (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory
>> either sequentially or randomly. The benchmark is mono thread, pinned to a
>> given core, and memory is allocate don the core associated with this node
>> using numa_alloc functions. perf_event_open is thus called to monitor only
>> this core/numa node. The code is compiled without any optimization and I
>> tried as possible to increase the ratio of memory accesses codes vs
>> branching, and other stuff code.
>>
>> I can clearly see with the QMC_NORMAL_READ.ANY events count that my random
>> accesses test case generate far more memory accesses than the sequential one
>> (I guess the prefetcher and sequential access are responsible for that).
>>
>> Regarding the sampled event, I successfully mmap the result of perf event
>> open and I am able to read the samples. My problem is that even for the
>> random accesses test case on the 120 megas, I don't have any samples served
>> by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period
>> is 1000 events and I only get ~700 samples). Nevertheless, in the sequential
>> case I have 0,01% of my samples that are remote cache accesses (1 Hop) where
>> as this percentage is ~20% in the random case.
>>
>> So my questions are:
>>
>> - What is exactly remote cache (1 hop) ? A data found in another core's
>> private cache (L1 or L2 in my case) on the same processor ?
>>
>> - How can I interpret my results, I was expecting to have local memory
>> accesses samples increasing in the random case instead of remote cache (1
>> hop). How can I have remote cache accesses with malloced data not shared,
>> and used by a thread pinned to a given core.
>>
>> - I didn't yet look at that on details, but 700~ samples for accessing 120
>> megas with sampling every 1000 events seems small. I am going to check the
>> assembly code generated and also count the total number of memory requests
>> (a core event) and compare that to the 700 samples * 1000 events.
>>
>> Thanks in advance for any suggestions you may have in order to help
>> understand what's happening there.
>>
>> --
>> Manu
--
To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to