Phil,
On Tue, May 23, 2006 at 02:59:19PM +0200, Philip Mucci wrote:
> Hi Stefane,
>
> Ok, first off try it with doubles. That convert operation probably
> happens 'in the FP pipe' therefore it is counted.
>
That's what I do at the bottom of my message and there I get the right answer.
> Since little or no HPC people use single precision (that we've worked
> with) we haven't received these reports. But it would make sense for
> that convert to be done in the FP pipes...
>
I suspect that some FP resource is used for the conversions back and forth.
> Are your events that same as ours, as far as register encodings?
>
Yes, it's just that the current libpfm setup for AMd does not have unit mask
combinations (no combined ADD_MULTIPLY). I think Kevin is working on this.
That would save one counter in this calculation.
> Our test cases produce the expected values...but we don't have a test
> case as below.
>
For doubles I believe you.
>
> On Tue, 2006-05-23 at 05:29 -0700, Stephane Eranian wrote:
> > Phil,
> >
> > On Tue, May 23, 2006 at 11:43:08AM +0200, Philip Mucci wrote:
> > > Hi Stephane,
> > >
> > > It sure can...in a number of ways...But I believe the SSE/SSE2 counting
> > > isn't as accurate as one might like...It was the Athlon which AMD blew
> > > it on...no FP counter!
> > >
> > > However, you have 3 choices.
> > > PNE_OPT_FP_ADD_PIPE
> > > PNE_OPT_FP_MULT_PIPE,
> > > PNE_OPT_FP_MULT_AND_ADD_PIPE,
> > >
> > > Event 0x100, 0x200 and 0x300
> >
> > Well, there are things I don't get understand here. Let's take
> > this simple program:
> >
> > #include <sys/types.h>
> > #include <stdio.h>
> > main(int argc, char **argv)
> > {
> > unsigned long i, n;
> > float f=4;
> >
> > n = strtoul(argv[1], NULL, 0);
> > for(i=0; i < n; i++) {
> > f+=1.9;
> > }
> > printf("f=%g\n", f);
> > }
> > Compiled with: cc float.c -o float -O3 -mtune=opteron -mcpu=opteron
> >
> > The loop generates the following code:
> > 400530: cvtss2sd (%rsp),%xmm0
> > 400535: dec %rax
> > 400538: addsd %xmm1,%xmm0
> > 40053c: cvtsd2ss %xmm0,%xmm2
> > 400540: movss %xmm2,(%rsp)
> > 400545: jne 400530 <main+0x30>
> >
> > With pfmon, I do 100,000,000 iterations:
> > $ pfmon --trigger-code-start=main --trigger-code-stop=main --us-c -u -e
> > cpu_clk_unhalted,retired_instructions,DISPATCHED_FPU_OPS_ADD,DISPATCHED_FPU_OPS_MULTIPLY
> > float 100000000
> > 2,308,816,413 CPU_CLK_UNHALTED
> > 600,006,705 RETIRED_INSTRUCTIONS
> > 150,002,866 DISPATCHED_FPU_OPS_ADD
> > 50,000,979 DISPATCHED_FPU_OPS_MULTIPLY
> >
> > I don't understand where those MULTIPLY come from. There are also
> > 50,000,000 additions extra.
> >
> > In constrast, if I use double (instead of float) and compile the same way.
> > I get the following code:
> > 400530: movlpd (%rsp),%xmm1
> > 400535: dec %rax
> > 400538: addsd %xmm0,%xmm1
> > 40053c: movsd %xmm1,(%rsp)
> > 400541: jne 400530 <main+0x30>
> >
> > And pfmon yields:
> >
> > 1,163,126,715 CPU_CLK_UNHALTED
> > 500,005,961 RETIRED_INSTRUCTIONS
> > 100,001,398 DISPATCHED_FPU_OPS_ADD
> > 4 DISPATCHED_FPU_OPS_MULTIPLY
> >
> > As such, I am inclined to believe that the cvt instructions are the cause
> > of this extra "noise". It may
> > be coming from the way they are actually implemented.
> >
> > It seems difficult to compute FLOPS on Opteron. I do not quite understand
> > the PIPE versions of those
> > events.
> >
> > Any clue?
> >
--
-Stephane
_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/