Hi Stephane, Vince,
When I run these tests, I am running on a RHEL6 system which uses a kernel
based on 2.6.32 but which has lots of features backported by redhat (like
support for uncore and Haswell). The tests are being run on a SNBEP system
(family 6 model 45).
I have run the perf command uncore tests again and discovered that even when I
run the same test repeatedly, sometimes it will work (both events return
counts) and sometimes it will fail (one or both events return errors). And I
think that I have now seen two system crashes while running these tests. The
attached file shows some of the test output if you are interested.
I am beginning to believe that the kernel I am running could be responsible for
this behavior. So if either of you guys could rerun these tests on a newer
kernel and tell me that all works fine, it would help to confirm my suspicions.
Of course if either or you know of any kernel commits that may have corrected
a problem like this, please share I am very interested.
Thanks
Gary
From: Stephane Eranian [mailto:eran...@googlemail.com]
Sent: Tuesday, September 09, 2014 1:37 AM
To: Vince Weaver
Cc: Gary Mohr; Michel Brown; perfmon2-devel
Subject: Re: [perfmon2] Error reporting when using invalid combination of
umasks.
On Mon, Sep 8, 2014 at 8:23 PM, Vince Weaver
<vincent.wea...@maine.edu<mailto:vincent.wea...@maine.edu>> wrote:
On Tue, 9 Sep 2014, Gary Mohr wrote:
> --- ls output removed ---
>
> Performance counter stats for '/bin/ls':
>
> 5,625 uncore_cbox_0/event=0x35,umask=0xa/
> [26.27%]
> <not supported> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
>
> 0.002038929 seconds time elapsed
>
>
> So this behaved similar to PAPI/libpfm4. The first event returned a count
> and the second event got an error.
> Just for fun, I used the same events in the opposite order:
>
>
> perf stat -a -e
> \{"uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/","uncore_cbox_0/event=0x35,umask=0xa/"\}
> /bin/ls
>
> --- ls output removed ---
>
> Performance counter stats for '/bin/ls':
>
> <not counted> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
> <not supported> uncore_cbox_0/event=0x35,umask=0xa/
>
> 0.002003219 seconds time elapsed
>
>
> This caused both events to report an error. This seems to me like a kernel
> problem. I also tried using each event by itself and they both returned
> counts. With PAPI/libpfm4 I believe that this test will return a count for
> the first event and an error on the second.
> You implied that the { } 's may influence if or how events are grouped. So I
> tried the command again in the original order without the { } characters and
> got this:
>
>
> perf stat -a -e
> "uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
> /bin/ls
>
> --- ls output removed ---
>
> Performance counter stats for '/bin/ls':
>
> 57,288 uncore_cbox_0/event=0x35,umask=0xa/
> [18.05%]
> 158,292 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
> [ 3.07%]
>
> 0.001963151 seconds time elapsed
>
>
> Both events give a count. I have never seen this result with PAPI/libpfm4
> but I have never tried them with grouping enabled when calling the kernel.
>
> In PAPI we turned grouping off so that the kernel would allow us to use
> events from different uncore pmu's at the same time. I can try turning it
> back on and running these two events to see what happens. If they work,
> maybe a better solution is to try a hybrid form of grouping. We could create
> a different group for each uncore pmu and put all the events associated with
> a given pmu into that pmu's group. We would then call the kernel once for
> each group rather than once for each event as we are doing now.
>
> Any idea if the kernel will let us play the game this way ??
Interesting, I'll have to run some more tests on my Sandybridge-EP
machine.
What kernel are you running again? I'm testing on a machine running 3.14
so possibly there were scheduling bugs with older kernels that were fixed
at some point.
When running both with and without {} I get something like:
Performance counter stats for 'system wide':
606 uncore_cbox_0/event=0x35,umask=0xa/
[99.61%]
247 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
0.000851895 seconds time elapsed
Which makes it look like it's multiplexing the events in some sort of way
I'm not really following, maybe to avoid a scheduling issue.
It should not need to multiplex in this case. I believe the events are
compatible.
I'd like to see what is going on with the latest upstream kernel.
There were some issues processing the filter in the uncore code, but not for
SNBEP
AFAIR.
Vince
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net<mailto:perfmon2-devel@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
I reran the tests using the perf command with these events. Like last time, in
some cases the results seemed inconsistent. But when looking at it closer, it
appears like there is something in the kernel which is left busy (or locked)
after a test is run. This causes a rerun of the same test to fail immediately
after it had just worked. The failure may be reported as one of the events
returning an erorr or both of them returning an error. If I continue to rerun
the test, eventually both events will report errors. But after getting into
this state, I have seen the events start working again. Unfortunately I have
also seen a couple of system crashes when rerunning the test after getting
into this state.
When running perf with { } around the events, the first event seems to work and
the
second event reports a <not supported> error. This seems to be pretty
consistent.
perf stat -a -e
\{"uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/","uncore_cbox_0/event=0x35,umask=0xa/"\}
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
4,978 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[25.99%]
<not supported> uncore_cbox_0/event=0x35,umask=0xa/
0.002133296 seconds time elapsed
perf stat -a -e
\{"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"\}
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
5,171 uncore_cbox_0/event=0x35,umask=0xa/
[26.32%]
<not supported> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
0.001936927 seconds time elapsed
When running perf without { } around the events, to start with both events seem
to work. But after several runs, you can see that one of the events started
reporting an error then started working again. Then a short time after that
both events started reporting an error.
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/","uncore_cbox_0/event=0x35,umask=0xa/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
3,410 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[17.68%]
3,640 uncore_cbox_0/event=0x35,umask=0xa/
[ 3.19%]
0.002091525 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
6,366 uncore_cbox_0/event=0x35,umask=0xa/
[18.10%]
1,749 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[ 3.49%]
0.001990845 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
8,824 uncore_cbox_0/event=0x35,umask=0xa/
[12.64%]
1,943 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[10.14%]
0.001942538 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
<not counted> uncore_cbox_0/event=0x35,umask=0xa/
3,148 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[ 7.66%]
0.002072645 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
<not counted> uncore_cbox_0/event=0x35,umask=0xa/
3,205 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[ 8.64%]
0.001990900 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
2,556 uncore_cbox_0/event=0x35,umask=0xa/
[ 1.45%]
2,525 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[ 7.76%]
0.002104193 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/","uncore_cbox_0/event=0x35,umask=0xa/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
30,216 uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
[16.70%]
109,673 uncore_cbox_0/event=0x35,umask=0xa/
[ 0.44%]
0.001983598 seconds time elapsed
Once both events were reporting an error, when using the perf command with { }
around the events the first event reported a <not counted> error and the second
event reported a <not supported> error. But when omitting the { } in the
command line, both events reported a <not counted> error. As stated above
I have seen whatever causes these errors to be reported go away and the events
start reporting counts again. But I have also seen a couple of system crashes
when trying to rerun the tests while in this condition.
perf stat -a -e
\{"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"\}
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
<not counted> uncore_cbox_0/event=0x35,umask=0xa/
<not supported> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
0.001981917 seconds time elapsed
perf stat -a -e
\{"uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/","uncore_cbox_0/event=0x35,umask=0xa/"\}
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
<not counted> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
<not supported> uncore_cbox_0/event=0x35,umask=0xa/
0.001957310 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0xa/","uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
<not counted> uncore_cbox_0/event=0x35,umask=0xa/
<not counted> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
0.001998865 seconds time elapsed
perf stat -a -e
"uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/","uncore_cbox_0/event=0x35,umask=0xa/"
/bin/ls
bhpcrun bwlcudaout
Performance counter stats for '/bin/ls':
<not counted> uncore_cbox_0/event=0x35,umask=0xa/
<not counted> uncore_cbox_0/event=0x35,umask=0x4a,filter_nid=0x1/
0.001978653 seconds time elapsed
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel