Hi Marc and Vince, On-demand dynamic linking might be a possible reason. The on-demand dynamic linking code that loads the data for the unresolved symbols after counting starts will contribute to the counts.
Please try to make your executable statically linked, run it again, and see the result. Or alternatively, prepend "env LD_BIND_NOW=on" (csh syntax) in the command line. This was from our past experience, where some extra count values came from the dynamic runtime linker (ld.so). A minimally simple test program written in C using PAPI API that did no floating point calculation consistently reported non-zero values for PAPI_FP_INS, and the values varied from run to run. We traced it down to the dynamic loader ld.so, and built a new, statically linked executable, which consistently reported 0 for the event. At the time, Rick Kufrin (at NCSA) and I reproduced the 0 count using any of the following 3 methods: 1) static linking, 2) setting LD_BIND_NOW to "on", 3) warming up the dynamic linking in the code, by calling thus resolving the symbols before calling PAPI_start(). So it's confirmed that the on-demand dynamic linking was the root cause for that scenario. This was on an x86-64 (Intel Xeon "Clovertown") Linux machine (NCSA's Abe cluster), to answer a TeraGrid user question on PAPI. This is just one possible reason from our experience, not to exclude other possibilities. Good luck with your exploration! :-) Thanks, Rui Vince Weaver wrote: > On Mon, 22 Mar 2010, Marc Brünink wrote: >> short version: Is it in theory possible to get 100% accurate result for >> BR_INST_RETIRED:ANY? > > which CPU type are you running this on? > >> So: Is it possible to get a 100% accurate performance counter result (in >> theory) (for BR_INST_RETIRED:ANY)? Which are the factors distorting the >> results? Is it possible to get rid of them completely? > > I think it's interrupts again, as with the retired_instruction case. > > I ran some tests on a core2 machine using perf on a 2.6.32 core2 machine. > The test was my "ten_billion" micro-benchmark I was talking about on > Friday. > > Performance counter stats for './ten_billion': > > 10000000514 instructions # 0.000 IPC > 4999990511 branches > 1 page-faults > > 1.675311657 seconds time elapsed > > > The expected number of branches for this code is 4,999,989,996. So > the overcount is 513, which is almost exactly the same as the extra count > on retired instructions, which makes it look like the retired branch count > is also being incremented once each time an interrupt happens. > > Vince > vweav...@eecs.utk.edu ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel