Greetings, I have been trying to compare some memory simulation results for my research with the results from hardware performance counters using libpfm4. I wanted to sanity check the results that I was getting from my performance counters on an Intel Core2Duo (T7200, 4MB L2 Cache) that seems to be somewhat similar to the one used in Stéphane Eranian's paper "What can performance counters do for memory subsystem analysis?". Since the paper uses the old version of the library, I wanted to make sure that I could similar results with libpfm4 as a sanity check.
One of the things I am interested in is extremely cache unfriendly workloads and the mcol example in the paper seemed like a perfect fit, but so far I have been unable to replicate the 99.7% miss rate reported in the paper. Since only the meaty bits of the program were in the paper, I inferred the rest and wrote a (not quite correct) hack for the difftv_usec function. I've attached my c code. When I compile this program with gcc4 and run it using perf_examples/task, i get the following result: $ ./task -e LLC_MISSES,LLC_REFERENCES ./mcol Allocating 16 MiB for 1024x1024 matrix 1677.72MiB/s 2,126,459 LLC_MISSES (4,739,259,550 : 4,739,259,550) 131,373,392 LLC_REFERENCES (4,739,259,550 : 4,739,259,550) Obviously this is a far cry from the 99.7% miss rate that I'm expecting. One explanation I could imagine is that the prefetcher is doing a good job of bringing in cache lines since everything is stride 1, but I'm wondering why these results would be different than the paper. Could someone help me understand what is going on here or if I'm doing something wrong? Thank you, Paul
#include <sys/time.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #define NLOOP 500 typedef struct _complex_t { uint64_t real, img; } complex_t; double difftv_usec (struct timeval *tv1, struct timeval *tv2) { return ((double)(tv2->tv_sec - tv1->tv_sec)*1000000.0)+(double)((tv2->tv_usec - tv2->tv_usec)); } void mat_line(complex_t *m, int nr, int nc) { uint64_t real = 0, img = 0, nloop=NLOOP; struct timeval tv1, tv2; double mat_size, d_usecs; int i,j; gettimeofday(&tv1, NULL); while (nloop--) { for(i=0; i < nr; i++) { for(j=0; j < nc; j++) { real += m[j+nc*i].real; img += m[j+nc*i].img; } } } gettimeofday(&tv2, NULL); d_usecs = difftv_usec(&tv1, &tv2); mat_size = NLOOP*sizeof(complex_t)*nr*nc; printf("%.2fMiB/s\n", mat_size/d_usecs); } complex_t *gen_mat(int nr, int nc) { srand(123); printf("Allocating %ld MiB for %dx%d matrix\n",nr*nc*sizeof(complex_t)/1024/1024,nr,nc); complex_t *m = (complex_t*)malloc(nc*nr*sizeof(complex_t)); if (!m) { printf("Failed to allocate\n"); exit(-1); } int i,j; for (i=0; i<nr; i++) { for (j=0; j<nc; j++) { m[j+nc*i].real = rand(); m[j+nc*i].img = rand(); } } return m; } int main() { int nr=1024, nc=1024; complex_t *m = gen_mat(nr,nc); mat_line(m, nr, nc); }
------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel