[perfmon2] FLOPS on Nehalem

Dr. Vincent Keller Wed, 07 Oct 2009 06:54:23 -0700

Dear all,

First of all, I'm quite a newbie on perfmon, I hope my 2 questions will
not be too stupid and I apologize if it is the case. Before writing to
this mailing-list, I googlized and search in the list archive, without
success.


I'm currently integrating a performance monitoring module into a C++
project. I need to get, in a system-wide mode, the GFlops rate of a
processor, core per core.

To compute the GFlops (GigaFlop's per second) rate, I count the FLOP's
during a dt time and make the integration. I based my implementation on
 an example provided with the libpfm. To validate the monitored value, I
use two application's kernel (a full matrix-matrix multiplication and a
poisson solver. Both uses X87 and SSE floating point operations) that I
compute the exact number of FLOP's and the time.

I tried my monitor system on an Intel Core 2 Duo and on an Intel
Hapertown without any problem:

Poisson (app using X87):

vkel...@linpriv4:~/> ./mxv
 read n1 =            5 n2 =         1999 nn =      4000000
           0        1999        1999     4000000          25
   4.67165751457214
 Exact result:    35987998.0000000       sum=   35987998.0000000
  Mflop/s =   385.302217550264

Poisson (monitored):

[MM] Perf : 0.384496     [GFLOPS] for core0 at time 1254920517
[MM] Perf : 0.009804     [GFLOPS] for core1 at time 1254920517

(it means that the app ran on core 0 at the correct rate)

Matrix-Matrix multiplication (app using SSE2 instructions):

vkel...@linpriv4:~/> ./mxm
 size =   1000
       k               s          t     Mflop/s
 kji   0    0.1000000000D+10 0.8655E+00 0.2311E+04

Matrix-Matrix multiplication (monitored):

[MM] Perf : 0.014436     [GFLOPS] for core0 at time 1254920742
[MM] Perf : 2.300236     [GFLOPS] for core1 at time 1254920742

I used the event FP_COMP_OPS_EXE to measure the FLOP's quantity and the
gettimeofday function for the timing.

But when I turn to Intel Nehalem, things are getting bad. First of all,
the event FP_COM_OPS no more exist. Instead :

Umask-00 : 0x02 : [MMX] : MMX Uops
Umask-01 : 0x80 : [SSE_DOUBLE_PRECISION] : SSE* FP double precision Uops
Umask-02 : 0x04 : [SSE_FP] : SSE and SSE2 FP Uops
Umask-03 : 0x10 : [SSE_FP_PACKED] : SSE FP packed Uops
Umask-04 : 0x20 : [SSE_FP_SCALAR] : SSE FP scalar Uops
Umask-05 : 0x40 : [SSE_SINGLE_PRECISION] : SSE* FP single precision Uops
Umask-06 : 0x08 : [SSE2_INTEGER] : SSE2 integer Uops
Umask-07 : 0x01 : [X87] : Computational floating-point operations executed

(pfmon -i FP_COMP_OPS)

As far as I understood, each event fits in one HW counter (3 are
available on the nhm). My first idea is to sum all the values counted
for the 8 sub-events of FP_COMP_OPS:

FLOPS = FP_COMP_OPS:MMX + FP_COMP_OPS:SSE_DOUBLE_PRECISION +
FP_COMP_OPS:FP + FP_COMP_OPS:SSE_FP_PACKED + FP_COMP_OPS:SSE_FP:SCALAR,
etc...

So I measure the 8 events during dt and integrate then:

do i = 1,8
        FLOPS = sum (8*event(i) during dt/8)
end do
FLOP_per_second = FLOPS/dt

But the result is totally wrong :

[MM] Perf : 0.000010     [GFLOPS] for core0 at time 1254920956
[MM] Perf : 0.000003     [GFLOPS] for core2 at time 1254920956
[MM] Perf : 7.164090     [GFLOPS] for core4 at time 1254920956
[MM] Perf : 0.000000     [GFLOPS] for core6 at time 1254920956

for a "real" performance of
       k               s          t     Mflop/s
 kji   0    0.1000000000D+10 0.4570E+00 0.4377E+04

What is wrong ? How to measure the FLOP's quantity using the
FP_COMP_OPS:MMX, FP_COMP_OPS:SSE_DOUBLE_PRECISION, FP_COMP_OPS:FP, etc..
values ?

Secondly, I have another problem (of affinity ?) with the Nehalem. I
understood (thanks to
http://perfmon2.sourceforge.net/pfmon_intel_corei7.html) that it was
mandatory to precise the ANY_THREAD flag (for that I put the flag
PFM_NHM_SEL_ANYTHR to the pfmlib_nhm_counter_t structure) to avoid the
problem of HT (linux kernel "thinks" he has 2 physical cores instead of
one). But the problem still remains: it can happens that the module
measures 0 FLOP's when an application is running. My declaration is :

memset(&mod_inp_nhm, 0, sizeof(mod_inp_nhm));

for (int ctr = 0; ctr<PMU_NHM_NUM_COUNTERS;ctr++){
        mod_inp_nhm.pfp_nhm_counters[ctr].flags=PFM_NHM_SEL_ANYTHR;
}

the mod_inp_nhm structure is then passed to the pfm_dispatch_events
function.

And I measure the flops whenever it is odd:

for (int k = 0 ; k < number_of_cores ; k++){
        uint64_t value_flops = 0UL;
        double gflops = 0.0;
        if (k%2==0){
                value_flops = mm->getFlops(k,dt);
        }
}

What do I do wrong in my understanding ?

Thanks in advance.

Best regards
Vince

-- 
---------------------------------------------------
Dr. Vincent KELLER

Fraunhofer-Institut für Algorithmen
und Wissenschaftliches Rechnen SCAI
           http://scai.fraunhofer.de
ADDRESS:   Schloss Birlinghoven
           D - 53754 Sankt Augustin
           Germany
PHONE  :   + 49 (0) 2241/14-2280
FAX    :   + 49 (0) 2241/14-2258
E-MAIL :   vincent.kel...@scai-extern.fraunhofer.de
---------------------------------------------------

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

[perfmon2] FLOPS on Nehalem

Reply via email to