Hi all, Anyone can help on this subject about low number of sample and remote cache samples meaning ?
Thanks, Manu 2014-01-23 Manuel Selva <selva.man...@gmail.com>: > Hi all, > > Today I followed up my investigations on this subject. > > Concerning the first question about, remote cache samples, I am still > not able to understand why i get this kind of events instead of Local > ram accesses. I modified my memory allocation function to use > numa_alloc function to be sure the memory is physically allocated > where I want. When the memory is allocated on the same node as the > thread running my program, I still have remote cache accesses and no > ram accesses. When it's on another node, I have remote ram accesses as > expected. > > Does anybody have an explanation of the exact meaning of these remote > cache events that could explain why I get them ? > > Regarding the the number of samples, I first checked the number of > memory loads generated by the sampled function (I mean here > statically, looking at the code generated by gcc). I then compared > this number with the core event: MEM_INST_RETIRED.LOADS, they are very > close (measured = 15001804, expected = 15001804). The difference must > come from the loads generated by the start/stop ioctl calls and maybe > other thing happening behind the scene I must miss, anyway this is > coherent. With a sampling period of 1000 events, I only have 455 > samples. For period = 5000 I have 91 samples, and for period = 10000 > 45 samples. > > So my question is what is the reason for this low number of samples ? > > 1- Some memory accesses have a threshold smaller than 3 cycles and are > thus not counted at all ? > > 2- Is there any relation to the time taken by the PMI handler being > too large. My benchmarking code is only doing loads and no computation > at all ? > > 3- any other idea ? > > Manu > > PS: Because this list is called perf-users, I should maybe ask this > kind of question on another list. If it's the case, please let me know > where ? Thanks. > > 2014/1/17 Manuel Selva <selva.man...@gmail.com>: >> Hi all, >> >> I wrote a benchmarking program, in order to play with perf_event_open memory >> sampling capabilities (as discussed earlier on this list). >> >> My benchmark is allocating a large array of memory (120 megas by default), >> then I am starting sampling with perf_event_open and the >> MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the >> minimal one according to Intel's doc) along with memory accesses counting >> (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory >> either sequentially or randomly. The benchmark is mono thread, pinned to a >> given core, and memory is allocate don the core associated with this node >> using numa_alloc functions. perf_event_open is thus called to monitor only >> this core/numa node. The code is compiled without any optimization and I >> tried as possible to increase the ratio of memory accesses codes vs >> branching, and other stuff code. >> >> I can clearly see with the QMC_NORMAL_READ.ANY events count that my random >> accesses test case generate far more memory accesses than the sequential one >> (I guess the prefetcher and sequential access are responsible for that). >> >> Regarding the sampled event, I successfully mmap the result of perf event >> open and I am able to read the samples. My problem is that even for the >> random accesses test case on the 120 megas, I don't have any samples served >> by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period >> is 1000 events and I only get ~700 samples). Nevertheless, in the sequential >> case I have 0,01% of my samples that are remote cache accesses (1 Hop) where >> as this percentage is ~20% in the random case. >> >> So my questions are: >> >> - What is exactly remote cache (1 hop) ? A data found in another core's >> private cache (L1 or L2 in my case) on the same processor ? >> >> - How can I interpret my results, I was expecting to have local memory >> accesses samples increasing in the random case instead of remote cache (1 >> hop). How can I have remote cache accesses with malloced data not shared, >> and used by a thread pinned to a given core. >> >> - I didn't yet look at that on details, but 700~ samples for accessing 120 >> megas with sampling every 1000 events seems small. I am going to check the >> assembly code generated and also count the total number of memory requests >> (a core event) and compare that to the 700 samples * 1000 events. >> >> Thanks in advance for any suggestions you may have in order to help >> understand what's happening there. >> >> -- >> Manu -- To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html