Phil
I should have provided you with a little more information. The data files
created by the hpcrun have the following names:
bash.CPU_CYCLES.molson.5636.0x1604
bash.CPU_CYCLES.molson.5637.0x1605
bash.CPU_CYCLES.molson.5638.0x1606
bash.CPU_CYCLES.molson.5813.0x16b5
code.exe.CPU_CYCLES.molson.5634.0x1602
gzip.CPU_CYCLES.molson.5637.0x1605
rm.CPU_CYCLES.molson.5638.0x1606
rm.CPU_CYCLES.molson.5813.0x16b5
As the data files show this executable runs 4 bash shell scripts, does
two remove commands, and does one gzip command in addition to
the output for the program itself (code.exe). There are no thread files
created and I have run the program with strace and can not find any
kernel calls to clone (which I am told would show up if a pthread_create
had been done).
My understanding is that each of the things run by this application
runs as a seperate process. The data files show different PID's
(except where a bash shell is executing one of the commands).
I would think that the problems handling signals in a multi-threaded
process should not apply here. Am I wrong about this ?
The /proc/perfmon file shows:
perfmon version : 2.0
model : Montecito
fastctxsw : No
-- and lots more --
The Papi version installed is:
Name : papi Relocations: /usr
Version : 3.5.0 Vendor: Bull
Release : 1.b.3.Bull Build Date: Thu 05 Apr 2007
05:05:28 AM MST
Install Date: Thu 13 Mar 2008 08:32:31 AM MST Build Host: apollon
Group : Application Source RPM:
papi-3.5.0-1.b.3.Bull.src.rpm
-- and some more --
This was created from Papi-3.5.0 but packaged by Bull into an RPM.
I reran the test using a sample period of 16,000,000 and got the following
results:
real 1m38.636s
user 1m38.240s
sys 0m0.440s
Output from hpcprof:
Columns correspond to the following events [event:period (events/sample)]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (0 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (0 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (0 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (0 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (82 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (47 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (0 samples) [not shown]
CPU_OP_CYCLES_ALL:16000000 - CPU Operating Cycles -- All CPU cycles
counted (0 samples) [not shown]
CPU_OP_CYCLES_ALL (min):16000000 - CPU Operating Cycles -- All CPU cycles
counted (The minimum for events of this type.) (0 samples)
CPU_OP_CYCLES_ALL (max):16000000 - CPU Operating Cycles -- All CPU cycles
counted (The maximum for events of this type.) (129 samples)
CPU_OP_CYCLES_ALL (sum):16000000 - CPU Operating Cycles -- All CPU cycles
counted (Summed over all events of this type.) (129 samples)
Calculations with these numbers (129 times 16000000 divided by 1600) show :
1290000 (or about 1.3 total seconds)
This is still a very small percent of the CPU time reported by "time".
Actually we have
experimented with lots of different sample rates (very large and very
small) and all the
ones we have tried seem to produce about the same results.
When I run this test, the number of perfmon interrupts that occur (as shown
in /proc/interrupts) is very close to the total number of
samples shown in the hpcprof output. That suggests to me that we are not
ignoring or loosing interrupts. Instead it looks to me
like perfmon has masked monitoring (as part of handling an overflow) and
then somehow the process being monitored got back
into execution without restarting it. This likely happened during
interrupt processing since that is where the code responsible for
managing all this seems to be. What I do not know is wheather this has
happened many times or just once. If it just happened
once then I would expect the perfmon kernel debug data would have shown the
last interrupt processed masked monitoring
but never restarted it. In fact the kernel debug data does not show that.
This suggests to me that somehow we are allowing the
monitored process to execute with monitoring masked so the hardware is not
counting the events.
First I would like to know if you guys think the above analysis is correct
(after all I am just a rookie and do not understand all
this stuff yet).
And second if anyone has any suggestions that could help me detect that the
monitored process is being executed while monitoring
is masked. For instance is there a place in the code where we always go
when we are done processing an overflow that I may be
able to add a check to see if monitoring is masked and spit out some
additional debug info if that happens.
I am happy to do the work if you can point me in the right direction.
Gary
Philip Mucci
<[EMAIL PROTECTED]
> To
[EMAIL PROTECTED]
04/16/2008 04:20 cc
PM [EMAIL PROTECTED]
t, [EMAIL PROTECTED]
Subject
Re: [perfmon2] Pfmon and Hpcrun
give inconsistent results.
>>
Hi Gary,
>> Phil,
>>
>> On Wed, Apr 16, 2008 at 11:22 AM, Philip Mucci <[EMAIL PROTECTED]>
>> wrote:
>>> Folks,
>>>
>>> hpcrun does it's sampling inside the target process using first
>>> person
>>> access, not a 3rd person ptrace() like pfmon, so the process is
>>> implicitly blocked when processing samples, i.e. there are no
>>> dropped
>>> samples unless something else has gone wrong.
>>>
>> Thanks for clarifing this. It makes more sense given how PAPI works.
>
> I think this means that I just lost the best explanation I have had
> so far
> as to why I see these inconsistencies.
>
Yes, you are correct. The blocking is not an issue in hpcrun.
>>
>>> Another thing, you cannot rely on the sample count of hpcrun to
>>> compute cycles. Why? Because those are samples that only have not
>>> been
>>> dropped. If samples occur ourside of the sample space (as can happen
>>> when one has floating point exceptions), the address will be in
>>> kernel
>>> space and it will be dropped. pfmon has no concept of filtering out
>>> addresses, so even if you ask for user-space samples, you'll still
>>> get
>>> samples in the output with kernel addresses. I'm not sure what the
>>> default is for your version of pfmon.
>>>
>> PFmon does not do filtering of samples. It relies on the hardware via
>> the priv levels. By default, pfmon only measures at the user level.
>> That does not mean you won't get kernel-level samples because there
>> are boundary effects when sampling.
>>
>>> Which value is correct, according to /bin/time? 2Billion or 154
> Billion?
>>>
>> This is a valid point. Which value makes most sense related to time?
>
>
> Let me provide all the steps I use when running these tests. Maybe I
> am just doing something wrong and you can correct my misunderstanding.
>
> When I use "time" with my hpcrun test, it provides this information:
>
> time hpcrun -e CPU_CYCLES:32767 -o hpcrun.data -- ./code.exe >hpcrun
> 2>hpcrun.debug
> real 1m44.921s
> user 1m39.490s
> sys 0m2.420s
>
> The output from the hpcprof run on all of the data files produced by
> this
> test shows the following summary information:
>
> Columns correspond to the following events [event:period (events/
> sample)]
> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (9 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (29 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (41602 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (24490 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (7 samples) [not shown]
> CPU_CYCLES:32767 - CPU Cycles (5 samples) [not shown]
> CPU_CYCLES (min):32767 - CPU Cycles (The minimum for events of this
> type.) (1 samples)
> CPU_CYCLES (max):32767 - CPU Cycles (The maximum for events of this
> type.) (66127 samples)
> CPU_CYCLES (sum):32767 - CPU Cycles (Summed over all events of this
> type.) (66200 samples)
>
Gary, this executable must be multi-threaded? Do each of the threads
do the same amount of work? If so, the above is your clue. Most of the
threads are experiencing the perfmon2 race where the signal comes in
but gets dropped and thus monitoring is not restarted. PAPI from CVS
has fixes in there for this on perfmon2 platforms. Is this OS using
perfmon2 or the old 'monolithic' perfmon interface? If this is
perfmon1, then we may have an issue here. But PAPI-CVS handles this
properly for perfmon2 by using a real-time signal.
Judging from the other numbers you have below, I'd guess that if you
set the sample rate to something much lower (which is certainly
reasonable, 32768 is awfully small for a 1600Mhz processor), then
you'd get more reasonable results. Experience (from Mark and the Rice
folks) have shown that this signal dropping is much less likely to
happen when the interrupt load is low.
So, I reckon if you set the sample period to 16,000,000 (approximate
100/second), you'll get answers that match up.
Phil
P.S. Please get back to me on which version of PAPI and perfmon kernel
support you have.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel