[perfmon] Re: PAPI FLOPS on AMD

Stephane Eranian Tue, 23 May 2006 06:04:23 -0700

Phil,

On Tue, May 23, 2006 at 02:59:19PM +0200, Philip Mucci wrote:
> Hi Stefane,
> 
> Ok, first off try it with doubles. That convert operation probably
> happens 'in the FP pipe' therefore it is counted.
> 
That's what I do at the bottom of my message and there I get the right answer.


> Since little or no HPC people use single precision (that we've worked
> with) we haven't received these reports. But it would make sense for
> that convert to be done in the FP pipes...
> 
I suspect that some FP resource is used for the conversions back and forth.

> Are your events that same as ours, as far as register encodings?
> 
Yes, it's just that the current libpfm setup for AMd does not have unit mask
combinations (no combined ADD_MULTIPLY). I think Kevin is working on this.
That would save one counter in this calculation.

> Our test cases produce the expected values...but we don't have a test
> case as below.
> 
For doubles I believe you.


> 
>  On Tue, 2006-05-23 at 05:29 -0700, Stephane Eranian wrote:
> > Phil,
> > 
> > On Tue, May 23, 2006 at 11:43:08AM +0200, Philip Mucci wrote:
> > > Hi Stephane,
> > > 
> > > It sure can...in a number of ways...But I believe the SSE/SSE2 counting
> > > isn't as accurate as one might like...It was the Athlon which AMD blew
> > > it on...no FP counter!
> > > 
> > > However, you have 3 choices.
> > >    PNE_OPT_FP_ADD_PIPE
> > >    PNE_OPT_FP_MULT_PIPE,
> > >    PNE_OPT_FP_MULT_AND_ADD_PIPE,
> > > 
> > > Event 0x100, 0x200 and 0x300
> > 
> > Well, there are things I don't get understand here. Let's take 
> > this simple program:
> > 
> > #include <sys/types.h>
> > #include <stdio.h>
> > main(int argc, char **argv)
> > {
> >     unsigned long i, n;
> >     float f=4;
> > 
> >     n = strtoul(argv[1], NULL, 0);
> >     for(i=0; i < n; i++) {
> >             f+=1.9;
> >     }
> >     printf("f=%g\n", f);
> > }
> > Compiled with: cc float.c -o float -O3 -mtune=opteron -mcpu=opteron
> > 
> > The loop generates the following code:
> >   400530:       cvtss2sd (%rsp),%xmm0
> >   400535:       dec    %rax
> >   400538:       addsd  %xmm1,%xmm0
> >   40053c:       cvtsd2ss %xmm0,%xmm2
> >   400540:       movss  %xmm2,(%rsp)
> >   400545:       jne    400530 <main+0x30>
> > 
> > With pfmon, I do 100,000,000 iterations:
> > $ pfmon --trigger-code-start=main --trigger-code-stop=main --us-c -u -e 
> > cpu_clk_unhalted,retired_instructions,DISPATCHED_FPU_OPS_ADD,DISPATCHED_FPU_OPS_MULTIPLY
> >  float 100000000
> > 2,308,816,413 CPU_CLK_UNHALTED
> >   600,006,705 RETIRED_INSTRUCTIONS
> >   150,002,866 DISPATCHED_FPU_OPS_ADD
> >    50,000,979 DISPATCHED_FPU_OPS_MULTIPLY
> > 
> > I don't understand where those MULTIPLY come from. There are also 
> > 50,000,000 additions extra.
> > 
> > In constrast, if I use double (instead of float) and compile the same way. 
> > I get the following code:
> >   400530:       movlpd (%rsp),%xmm1
> >   400535:       dec    %rax
> >   400538:       addsd  %xmm0,%xmm1
> >   40053c:       movsd  %xmm1,(%rsp)
> >   400541:       jne    400530 <main+0x30>
> > 
> > And pfmon yields:
> > 
> > 1,163,126,715 CPU_CLK_UNHALTED
> >   500,005,961 RETIRED_INSTRUCTIONS
> >   100,001,398 DISPATCHED_FPU_OPS_ADD
> >             4 DISPATCHED_FPU_OPS_MULTIPLY
> > 
> > As such, I am inclined to believe that the cvt instructions are the cause 
> > of this extra "noise". It may
> > be coming from the way they are actually implemented.
> > 
> > It seems difficult to compute FLOPS on Opteron. I do not quite understand 
> > the PIPE versions of those
> > events.
> > 
> > Any clue?
> > 

-- 

-Stephane
_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

[perfmon] Re: PAPI FLOPS on AMD

Reply via email to