Hi Marc and Vince,

On-demand dynamic linking might be a possible reason. The on-demand dynamic 
linking code that loads the data for the unresolved symbols after counting 
starts will contribute to the counts.

Please try to make your executable statically linked, run it again, and see the 
result. Or alternatively, prepend "env LD_BIND_NOW=on" (csh syntax) in the 
command line.

This was from our past experience, where some extra count values came from the 
dynamic runtime linker (ld.so). A minimally simple test program written in C 
using PAPI API that did no floating point calculation consistently reported 
non-zero values for PAPI_FP_INS, and the values varied from run to run. We 
traced it down to the dynamic loader ld.so, and built a new, statically linked 
executable, which consistently reported 0 for the event. At the time, Rick 
Kufrin (at NCSA) and I reproduced the 0 count using any of the following 3 
methods:
1) static linking,
2) setting LD_BIND_NOW to "on",
3) warming up the dynamic linking in the code, by calling thus resolving the 
symbols before calling PAPI_start().
So it's confirmed that the on-demand dynamic linking was the root cause for 
that scenario. This was on an x86-64 (Intel Xeon "Clovertown") Linux machine 
(NCSA's Abe cluster), to answer a TeraGrid user question on PAPI.

This is just one possible reason from our experience, not to exclude other 
possibilities. Good luck with your exploration! :-)

Thanks,
Rui

Vince Weaver wrote:
> On Mon, 22 Mar 2010, Marc Brünink wrote:
>> short version: Is it in theory possible to get 100% accurate result for 
>> BR_INST_RETIRED:ANY?
> 
> which CPU type are you running this on?
> 
>> So: Is it possible to get a 100% accurate performance counter result (in 
>> theory) (for BR_INST_RETIRED:ANY)? Which are the factors distorting the 
>> results? Is it possible to get rid of them completely?
> 
> I think it's interrupts again, as with the retired_instruction case.
> 
> I ran some tests on a core2 machine using perf on a 2.6.32 core2 machine.
> The test was my "ten_billion" micro-benchmark I was talking about on 
> Friday.
> 
>  Performance counter stats for './ten_billion':
> 
>     10000000514  instructions             #      0.000 IPC  
>      4999990511  branches                
>               1  page-faults             
> 
>     1.675311657  seconds time elapsed
> 
> 
> The expected number of branches for this code is 4,999,989,996.  So
> the overcount is 513, which is almost exactly the same as the extra count 
> on retired instructions, which makes it look like the retired branch count 
> is also being incremented once each time an interrupt happens.
> 
> Vince
> vweav...@eecs.utk.edu


------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Reply via email to