Remote Cache Accesses (1 Hop)

Manuel Selva Fri, 17 Jan 2014 05:24:57 -0800

Hi all,

I wrote a benchmarking program, in order to play with perf_event_openmemory sampling capabilities (as discussed earlier on this list).

My benchmark is allocating a large array of memory (120 megas bydefault), then I am starting sampling with perf_event_open and theMEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, theminimal one according to Intel's doc) along with memory accessescounting (uncore event QMC_NORMAL_READ.ANY) and I access all theallocated memory either sequentially or randomly. The benchmark is monothread, pinned to a given core, and memory is allocate don the coreassociated with this node using numa_alloc functions. perf_event_open isthus called to monitor only this core/numa node. The code is compiledwithout any optimization and I tried as possible to increase the ratioof memory accesses codes vs branching, and other stuff code.

I can clearly see with the QMC_NORMAL_READ.ANY events count that myrandom accesses test case generate far more memory accesses than thesequential one (I guess the prefetcher and sequential access areresponsible for that).

Regarding the sampled event, I successfully mmap the result of perfevent open and I am able to read the samples. My problem is that evenfor the random accesses test case on the 120 megas, I don't have anysamples served by the RAM (I am using the PERF_SAMPLE_DATA_SRC field)(the sampling period is 1000 events and I only get ~700 samples).Nevertheless, in the sequential case I have 0,01% of my samples that areremote cache accesses (1 Hop) where as this percentage is ~20% in therandom case.


So my questions are:

- What is exactly remote cache (1 hop) ? A data found in another core'sprivate cache (L1 or L2 in my case) on the same processor ?

- How can I interpret my results, I was expecting to have local memoryaccesses samples increasing in the random case instead of remote cache(1 hop). How can I have remote cache accesses with malloced data notshared, and used by a thread pinned to a given core.

- I didn't yet look at that on details, but 700~ samples for accessing120 megas with sampling every 1000 events seems small. I am going tocheck the assembly code generated and also count the total number ofmemory requests (a core event) and compare that to the 700 samples *1000 events.

Thanks in advance for any suggestions you may have in order to helpunderstand what's happening there.


--
Manu
--
To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Remote Cache Accesses (1 Hop)

Reply via email to