[perfmon2] Replicating 'What can performance counters do for memory subsystem analysis?' results

DRAM Ninjas Mon, 13 Sep 2010 14:00:15 -0700

Greetings,

I have been trying to compare some memory simulation results for my research
with the results from hardware performance counters using libpfm4. I wanted
to sanity check the results that I was getting from my performance counters
on an Intel Core2Duo (T7200, 4MB L2 Cache) that seems to be somewhat similar
to the one used in Stéphane Eranian's paper "What can performance counters
do for memory subsystem analysis?".  Since the paper uses the old version of
the library, I wanted to make sure that I could similar results with libpfm4
as a sanity check.


One of the things I am interested in is extremely cache unfriendly workloads
and the mcol example in the paper seemed like a perfect fit, but so far I
have been unable to replicate the 99.7% miss rate reported in the paper.
Since only the meaty bits of the program were in the paper, I inferred the
rest and wrote a (not quite correct) hack for the difftv_usec function. I've
attached my c code.

When I compile this program with gcc4 and run it using perf_examples/task, i
get the following result:

$ ./task -e LLC_MISSES,LLC_REFERENCES ./mcol
Allocating 16 MiB for 1024x1024 matrix
1677.72MiB/s

           2,126,459 LLC_MISSES (4,739,259,550 : 4,739,259,550)
         131,373,392 LLC_REFERENCES (4,739,259,550 : 4,739,259,550)

Obviously this is a far cry from the 99.7% miss rate that I'm expecting. One
explanation I could imagine is that the prefetcher is doing a good job of
bringing in cache lines since everything is stride 1, but I'm wondering why
these results would be different than the paper.

Could someone help me understand what is going on here or if I'm doing
something wrong?

Thank you,
Paul

#include <sys/time.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define NLOOP 500
typedef struct _complex_t 
{ 
	uint64_t real, img;
} complex_t;
double difftv_usec (struct timeval *tv1, struct timeval *tv2)
{
	return ((double)(tv2->tv_sec - tv1->tv_sec)*1000000.0)+(double)((tv2->tv_usec - tv2->tv_usec));
}

void mat_line(complex_t *m, int nr, int nc)
{
	uint64_t real = 0, img = 0, nloop=NLOOP;
	struct timeval tv1, tv2;
	double mat_size, d_usecs;
	int i,j;
	gettimeofday(&tv1, NULL);
	while (nloop--) {
		for(i=0; i < nr; i++) {
			for(j=0; j < nc; j++) {
				real += m[j+nc*i].real;
				img += m[j+nc*i].img;
			}
		}
	}
	gettimeofday(&tv2, NULL);
	d_usecs = difftv_usec(&tv1, &tv2);
	mat_size = NLOOP*sizeof(complex_t)*nr*nc;
	printf("%.2fMiB/s\n", mat_size/d_usecs);
}

complex_t *gen_mat(int nr, int nc)
{
	srand(123); 
	printf("Allocating %ld MiB for %dx%d matrix\n",nr*nc*sizeof(complex_t)/1024/1024,nr,nc);
	complex_t *m = (complex_t*)malloc(nc*nr*sizeof(complex_t)); 
	if (!m) {
		printf("Failed to allocate\n");
		exit(-1);
	}
	int i,j;
	for (i=0; i<nr; i++)
	{
		for (j=0; j<nc; j++)
		{
			m[j+nc*i].real = rand(); 
			m[j+nc*i].img = rand(); 
		}
	}
	return m;
}

int main() {
	int nr=1024, nc=1024; 
	complex_t *m = gen_mat(nr,nc); 
	mat_line(m, nr, nc);
}

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing
http://p.sf.net/sfu/novell-sfdev2dev

_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

[perfmon2] Replicating 'What can performance counters do for memory subsystem analysis?' results

Reply via email to