Re: [perfmon2] FLOPS on Nehalem

stephane eranian Wed, 07 Oct 2009 10:18:54 -0700

Vincent,

Hugh is right!
Be careful than on Core i7, micro-ops are counted not instructions.


Other users have also reported variations in the number of
micro-ops reported for the same instruction. It depends on
the floating point values passed and whether or not they
reach the limit of their types (e.g., denormals).

As for PFM_NHM_SEL_ANYTHR, it is not mandatory at all.
In fact you probably don't want to use it. If you run pfmon
on all logical cores (without --cpu-list), then you can compute
total FLOPS by adding up each per-cpu counts. Alternatively
you can use the --aggr option to have pfmon do it for you.


On Wed, Oct 7, 2009 at 4:50 PM, Caffey, Hugh M <[email protected]> wrote:
> Hi -
>
> First, note that these events count micro-operations (not full "macro" 
> instructions)
> that *executed* in the floating-point unit (but did not, necessarily, retire).
> (The event names used below may not be exactly the same as those used by 
> perfmon2.)
>
> At the highest level on Corei7, total micro-operations executed in the FPU =
>  ( FP_COMP_OPS_EXE.X87 + FP_COMP_OPS_EXE.MMX +
>    FP_COMP_OPS_EXE.SSE_FP + FP_COMP_OPS_EXE.SSE2_INTEGER )
> (If you only care about actual f-p operations, omit the .SSE2_INTEGER event.)
>
>
> If you want more detail specifically about SSE f-p operations,
> note the following relationships:
>  FP_COMP_OPS_EXE.SSE_FP = FP_COMP_OPS_EXE.SSE_FP_PACKED ("vector" operations) 
> +
>                          FP_COMP_OPS_EXE.SSE_FP_SCALAR
>    also:
>  FP_COMP_OPS_EXE.SSE_FP = FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION +
>                          FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION
>
>
> Hope this helps.
>
> Hugh Caffey
>
>>-----Original Message-----
>>From: Dr. Vincent Keller
>>[mailto:[email protected]]
>>Sent: Wednesday, October 07, 2009 6:31 AM
>>To: [email protected]
>>Subject: [perfmon2] FLOPS on Nehalem
>>
>>Dear all,
>>
>>First of all, I'm quite a newbie on perfmon, I hope my 2 questions will
>>not be too stupid and I apologize if it is the case. Before writing to
>>this mailing-list, I googlized and search in the list archive, without
>>success.
>>
>>I'm currently integrating a performance monitoring module into a C++
>>project. I need to get, in a system-wide mode, the GFlops rate of a
>>processor, core per core.
>>
>>To compute the GFlops (GigaFlop's per second) rate, I count the FLOP's
>>during a dt time and make the integration. I based my implementation on
>> an example provided with the libpfm. To validate the
>>monitored value, I
>>use two application's kernel (a full matrix-matrix multiplication and a
>>poisson solver. Both uses X87 and SSE floating point operations) that I
>>compute the exact number of FLOP's and the time.
>>
>>I tried my monitor system on an Intel Core 2 Duo and on an Intel
>>Hapertown without any problem:
>>
>>Poisson (app using X87):
>>
>>vkel...@linpriv4:~/> ./mxv
>> read n1 =            5 n2 =         1999 nn =      4000000
>>           0        1999        1999     4000000          25
>>   4.67165751457214
>> Exact result:    35987998.0000000       sum=   35987998.0000000
>>  Mflop/s =   385.302217550264
>>
>>Poisson (monitored):
>>
>>[MM] Perf : 0.384496     [GFLOPS] for core0 at time 1254920517
>>[MM] Perf : 0.009804     [GFLOPS] for core1 at time 1254920517
>>
>>(it means that the app ran on core 0 at the correct rate)
>>
>>Matrix-Matrix multiplication (app using SSE2 instructions):
>>
>>vkel...@linpriv4:~/> ./mxm
>> size =   1000
>>       k               s          t     Mflop/s
>> kji   0    0.1000000000D+10 0.8655E+00 0.2311E+04
>>
>>Matrix-Matrix multiplication (monitored):
>>
>>[MM] Perf : 0.014436     [GFLOPS] for core0 at time 1254920742
>>[MM] Perf : 2.300236     [GFLOPS] for core1 at time 1254920742
>>
>>I used the event FP_COMP_OPS_EXE to measure the FLOP's quantity and the
>>gettimeofday function for the timing.
>>
>>But when I turn to Intel Nehalem, things are getting bad. First of all,
>>the event FP_COM_OPS no more exist. Instead :
>>
>>Umask-00 : 0x02 : [MMX] : MMX Uops
>>Umask-01 : 0x80 : [SSE_DOUBLE_PRECISION] : SSE* FP double
>>precision Uops
>>Umask-02 : 0x04 : [SSE_FP] : SSE and SSE2 FP Uops
>>Umask-03 : 0x10 : [SSE_FP_PACKED] : SSE FP packed Uops
>>Umask-04 : 0x20 : [SSE_FP_SCALAR] : SSE FP scalar Uops
>>Umask-05 : 0x40 : [SSE_SINGLE_PRECISION] : SSE* FP single
>>precision Uops
>>Umask-06 : 0x08 : [SSE2_INTEGER] : SSE2 integer Uops
>>Umask-07 : 0x01 : [X87] : Computational floating-point
>>operations executed
>>
>>(pfmon -i FP_COMP_OPS)
>>
>>As far as I understood, each event fits in one HW counter (3 are
>>available on the nhm). My first idea is to sum all the values counted
>>for the 8 sub-events of FP_COMP_OPS:
>>
>>FLOPS = FP_COMP_OPS:MMX + FP_COMP_OPS:SSE_DOUBLE_PRECISION +
>>FP_COMP_OPS:FP + FP_COMP_OPS:SSE_FP_PACKED + FP_COMP_OPS:SSE_FP:SCALAR,
>>etc...
>>
>>So I measure the 8 events during dt and integrate then:
>>
>>do i = 1,8
>>       FLOPS = sum (8*event(i) during dt/8)
>>end do
>>FLOP_per_second = FLOPS/dt
>>
>>But the result is totally wrong :
>>
>>[MM] Perf : 0.000010     [GFLOPS] for core0 at time 1254920956
>>[MM] Perf : 0.000003     [GFLOPS] for core2 at time 1254920956
>>[MM] Perf : 7.164090     [GFLOPS] for core4 at time 1254920956
>>[MM] Perf : 0.000000     [GFLOPS] for core6 at time 1254920956
>>
>>for a "real" performance of
>>       k               s          t     Mflop/s
>> kji   0    0.1000000000D+10 0.4570E+00 0.4377E+04
>>
>>What is wrong ? How to measure the FLOP's quantity using the
>>FP_COMP_OPS:MMX, FP_COMP_OPS:SSE_DOUBLE_PRECISION,
>>FP_COMP_OPS:FP, etc..
>>values ?
>>
>>Secondly, I have another problem (of affinity ?) with the Nehalem. I
>>understood (thanks to
>>http://perfmon2.sourceforge.net/pfmon_intel_corei7.html) that it was
>>mandatory to precise the ANY_THREAD flag (for that I put the flag
>>PFM_NHM_SEL_ANYTHR to the pfmlib_nhm_counter_t structure) to avoid the
>>problem of HT (linux kernel "thinks" he has 2 physical cores instead of
>>one). But the problem still remains: it can happens that the module
>>measures 0 FLOP's when an application is running. My declaration is :
>>
>>memset(&mod_inp_nhm, 0, sizeof(mod_inp_nhm));
>>
>>for (int ctr = 0; ctr<PMU_NHM_NUM_COUNTERS;ctr++){
>>       mod_inp_nhm.pfp_nhm_counters[ctr].flags=PFM_NHM_SEL_ANYTHR;
>>}
>>
>>the mod_inp_nhm structure is then passed to the pfm_dispatch_events
>>function.
>>
>>And I measure the flops whenever it is odd:
>>
>>for (int k = 0 ; k < number_of_cores ; k++){
>>       uint64_t value_flops = 0UL;
>>       double gflops = 0.0;
>>       if (k%2==0){
>>               value_flops = mm->getFlops(k,dt);
>>       }
>>}
>>
>>What do I do wrong in my understanding ?
>>
>>Thanks in advance.
>>
>>Best regards
>>Vince
>>
>>--
>>---------------------------------------------------
>>Dr. Vincent KELLER
>>
>>Fraunhofer-Institut für Algorithmen
>>und Wissenschaftliches Rechnen SCAI
>>           http://scai.fraunhofer.de
>>ADDRESS:   Schloss Birlinghoven
>>           D - 53754 Sankt Augustin
>>           Germany
>>PHONE  :   + 49 (0) 2241/14-2280
>>FAX    :   + 49 (0) 2241/14-2258
>>E-MAIL :   [email protected]
>>---------------------------------------------------
>>
>>---------------------------------------------------------------
>>---------------
>>Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>is the only developer event you need to attend this year.
>>Jumpstart your
>>developing skills, take BlackBerry mobile applications to
>>market and stay
>>ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>http://p.sf.net/sfu/devconference
>>_______________________________________________
>>perfmon2-devel mailing list
>>[email protected]
>>https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Re: [perfmon2] FLOPS on Nehalem

Reply via email to