Hi all, Today I followed up my investigations on this subject.
Concerning the first question about, remote cache samples, I am still not able to understand why i get this kind of events instead of Local ram accesses. I modified my memory allocation function to use numa_alloc function to be sure the memory is physically allocated where I want. When the memory is allocated on the same node as the thread running my program, I still have remote cache accesses and no ram accesses. When it's on another node, I have remote ram accesses as expected. Does anybody have an explanation of the exact meaning of these remote cache events that could explain why I get them ? Regarding the the number of samples, I first checked the number of memory loads generated by the sampled function (I mean here statically, looking at the code generated by gcc). I then compared this number with the core event: MEM_INST_RETIRED.LOADS, they are very close (measured = 15001804, expected = 15001804). The difference must come from the loads generated by the start/stop ioctl calls and maybe other thing happening behind the scene I must miss, anyway this is coherent. With a sampling period of 1000 events, I only have 455 samples. For period = 5000 I have 91 samples, and for period = 10000 45 samples. So my question is what is the reason for this low number of samples ? 1- Some memory accesses have a threshold smaller than 3 cycles and are thus not counted at all ? 2- Is there any relation to the time taken by the PMI handler being too large. My benchmarking code is only doing loads and no computation at all ? 3- any other idea ? Manu PS: Because this list is called perf-users, I should maybe ask this kind of question on another list. If it's the case, please let me know where ? Thanks. 2014/1/17 Manuel Selva <selva.man...@gmail.com>: > Hi all, > > I wrote a benchmarking program, in order to play with perf_event_open memory > sampling capabilities (as discussed earlier on this list). > > My benchmark is allocating a large array of memory (120 megas by default), > then I am starting sampling with perf_event_open and the > MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the > minimal one according to Intel's doc) along with memory accesses counting > (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory > either sequentially or randomly. The benchmark is mono thread, pinned to a > given core, and memory is allocate don the core associated with this node > using numa_alloc functions. perf_event_open is thus called to monitor only > this core/numa node. The code is compiled without any optimization and I > tried as possible to increase the ratio of memory accesses codes vs > branching, and other stuff code. > > I can clearly see with the QMC_NORMAL_READ.ANY events count that my random > accesses test case generate far more memory accesses than the sequential one > (I guess the prefetcher and sequential access are responsible for that). > > Regarding the sampled event, I successfully mmap the result of perf event > open and I am able to read the samples. My problem is that even for the > random accesses test case on the 120 megas, I don't have any samples served > by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period > is 1000 events and I only get ~700 samples). Nevertheless, in the sequential > case I have 0,01% of my samples that are remote cache accesses (1 Hop) where > as this percentage is ~20% in the random case. > > So my questions are: > > - What is exactly remote cache (1 hop) ? A data found in another core's > private cache (L1 or L2 in my case) on the same processor ? > > - How can I interpret my results, I was expecting to have local memory > accesses samples increasing in the random case instead of remote cache (1 > hop). How can I have remote cache accesses with malloced data not shared, > and used by a thread pinned to a given core. > > - I didn't yet look at that on details, but 700~ samples for accessing 120 > megas with sampling every 1000 events seems small. I am going to check the > assembly code generated and also count the total number of memory requests > (a core event) and compare that to the 700 samples * 1000 events. > > Thanks in advance for any suggestions you may have in order to help > understand what's happening there. > > -- > Manu -- To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html