Dear all, First of all, I'm quite a newbie on perfmon, I hope my 2 questions will not be too stupid and I apologize if it is the case. Before writing to this mailing-list, I googlized and search in the list archive, without success.
I'm currently integrating a performance monitoring module into a C++ project. I need to get, in a system-wide mode, the GFlops rate of a processor, core per core. To compute the GFlops (GigaFlop's per second) rate, I count the FLOP's during a dt time and make the integration. I based my implementation on an example provided with the libpfm. To validate the monitored value, I use two application's kernel (a full matrix-matrix multiplication and a poisson solver. Both uses X87 and SSE floating point operations) that I compute the exact number of FLOP's and the time. I tried my monitor system on an Intel Core 2 Duo and on an Intel Hapertown without any problem: Poisson (app using X87): vkel...@linpriv4:~/> ./mxv read n1 = 5 n2 = 1999 nn = 4000000 0 1999 1999 4000000 25 4.67165751457214 Exact result: 35987998.0000000 sum= 35987998.0000000 Mflop/s = 385.302217550264 Poisson (monitored): [MM] Perf : 0.384496 [GFLOPS] for core0 at time 1254920517 [MM] Perf : 0.009804 [GFLOPS] for core1 at time 1254920517 (it means that the app ran on core 0 at the correct rate) Matrix-Matrix multiplication (app using SSE2 instructions): vkel...@linpriv4:~/> ./mxm size = 1000 k s t Mflop/s kji 0 0.1000000000D+10 0.8655E+00 0.2311E+04 Matrix-Matrix multiplication (monitored): [MM] Perf : 0.014436 [GFLOPS] for core0 at time 1254920742 [MM] Perf : 2.300236 [GFLOPS] for core1 at time 1254920742 I used the event FP_COMP_OPS_EXE to measure the FLOP's quantity and the gettimeofday function for the timing. But when I turn to Intel Nehalem, things are getting bad. First of all, the event FP_COM_OPS no more exist. Instead : Umask-00 : 0x02 : [MMX] : MMX Uops Umask-01 : 0x80 : [SSE_DOUBLE_PRECISION] : SSE* FP double precision Uops Umask-02 : 0x04 : [SSE_FP] : SSE and SSE2 FP Uops Umask-03 : 0x10 : [SSE_FP_PACKED] : SSE FP packed Uops Umask-04 : 0x20 : [SSE_FP_SCALAR] : SSE FP scalar Uops Umask-05 : 0x40 : [SSE_SINGLE_PRECISION] : SSE* FP single precision Uops Umask-06 : 0x08 : [SSE2_INTEGER] : SSE2 integer Uops Umask-07 : 0x01 : [X87] : Computational floating-point operations executed (pfmon -i FP_COMP_OPS) As far as I understood, each event fits in one HW counter (3 are available on the nhm). My first idea is to sum all the values counted for the 8 sub-events of FP_COMP_OPS: FLOPS = FP_COMP_OPS:MMX + FP_COMP_OPS:SSE_DOUBLE_PRECISION + FP_COMP_OPS:FP + FP_COMP_OPS:SSE_FP_PACKED + FP_COMP_OPS:SSE_FP:SCALAR, etc... So I measure the 8 events during dt and integrate then: do i = 1,8 FLOPS = sum (8*event(i) during dt/8) end do FLOP_per_second = FLOPS/dt But the result is totally wrong : [MM] Perf : 0.000010 [GFLOPS] for core0 at time 1254920956 [MM] Perf : 0.000003 [GFLOPS] for core2 at time 1254920956 [MM] Perf : 7.164090 [GFLOPS] for core4 at time 1254920956 [MM] Perf : 0.000000 [GFLOPS] for core6 at time 1254920956 for a "real" performance of k s t Mflop/s kji 0 0.1000000000D+10 0.4570E+00 0.4377E+04 What is wrong ? How to measure the FLOP's quantity using the FP_COMP_OPS:MMX, FP_COMP_OPS:SSE_DOUBLE_PRECISION, FP_COMP_OPS:FP, etc.. values ? Secondly, I have another problem (of affinity ?) with the Nehalem. I understood (thanks to http://perfmon2.sourceforge.net/pfmon_intel_corei7.html) that it was mandatory to precise the ANY_THREAD flag (for that I put the flag PFM_NHM_SEL_ANYTHR to the pfmlib_nhm_counter_t structure) to avoid the problem of HT (linux kernel "thinks" he has 2 physical cores instead of one). But the problem still remains: it can happens that the module measures 0 FLOP's when an application is running. My declaration is : memset(&mod_inp_nhm, 0, sizeof(mod_inp_nhm)); for (int ctr = 0; ctr<PMU_NHM_NUM_COUNTERS;ctr++){ mod_inp_nhm.pfp_nhm_counters[ctr].flags=PFM_NHM_SEL_ANYTHR; } the mod_inp_nhm structure is then passed to the pfm_dispatch_events function. And I measure the flops whenever it is odd: for (int k = 0 ; k < number_of_cores ; k++){ uint64_t value_flops = 0UL; double gflops = 0.0; if (k%2==0){ value_flops = mm->getFlops(k,dt); } } What do I do wrong in my understanding ? Thanks in advance. Best regards Vince -- --------------------------------------------------- Dr. Vincent KELLER Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI http://scai.fraunhofer.de ADDRESS: Schloss Birlinghoven D - 53754 Sankt Augustin Germany PHONE : + 49 (0) 2241/14-2280 FAX : + 49 (0) 2241/14-2258 E-MAIL : vincent.kel...@scai-extern.fraunhofer.de --------------------------------------------------- ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel