[hpx-users] using papi on intel processors Fwd: [Ptools-perfapi] Notes on important errata on Intel processors...

Patricia Grubel Fri, 02 Sep 2016 12:36:01 -0700

Here is a chain of emails with important information about using papi events on 
Intel processors. I was asked to post it here so we would have a record for the 
hpx users.



Regards,
Pat Grubel


On Aug 31, 2016, at 6:42 PM, Stephane Eranian 
<[email protected]<mailto:[email protected]>> wrote:

Hi Phil,

On Tue, Aug 30, 2016 at 3:31 AM, Philip Mucci 
<[email protected]<mailto:[email protected]>> wrote:
Hi folks,

In some of my work, I frequently run into folks having problems with native 
Intel events. And most of you know, I like to harp on people to basically 
ignore preset events these days because they foster misunderstanding as 
hardware just isnâ€™t the same as when PAPI was first written (and through 
their general abstraction). However, native events arenâ€™t panaceaâ€¦ One 
still needs to often RT(f)M in order to fully understand what one is seeing. To 
save you that hassle, Iâ€™m providing this bit of info...as I find reading 
Intel docs right up there with having to read that Ayn Rand or Joel Osteen 
novel that crazy friends give you.

This is a message to a client who has had issues on HSX aka Haswell-EP and JKT 
aka Sandy Bridge EP processors, most of which is in common with much of the E5 
processor line. I suppose this should be turned into a FAQ entry on the PAPI 
page, but that depends on your comments, which are most welcome.

Note that the below native events are in Intel â€˜parlanceâ€™ with the â€˜.â€™ 
qualifier. libpfm now accepts these fully, thanks to the work of my good friend 
and perf-dude extraordinaire, Mr. Stephane Eranian of Google.

Regards,

Phil

Below is my list of events that I suspect are not mapped correctly.  These 
events remain consistently screwy for all of the applications that Iâ€™ve 
looked at so far, so itâ€™s not real application behavior.

SandyBridge (SNBEP aka JKT):
mem_load_uops_llc_miss_retired.remote_dram
mem_load_uops_retired.l1_hit
mem_load_uops_retired.l2_hit
mem_load_uops_retired.llc_hit
mem_load_uops_retired.llc_miss
mem_load_uops_llc_hit_retired.xsnp_hit
mem_load_uops_llc_hit_retired.xsnp_hitm
mem_load_uops_llc_hit_retired.xsnp_miss

All of the above have errata on the E5 processor. The errata are BT241 
(undercounts) and BT243 (unreliable/corruption). The former is a hardware bug 
the latter is a bug that is a byproduct of hyperthreading.

See:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf
 and page 82/83

There is a workaround for BT241, but it increases L3 and main memory latencies. 
It also requires some permissions that most regular users donâ€™t have, i.e. 
writing bits to /dev/cpu_dma_latency and /sys/pci as one tweak some bits in 
MSRs and the PCI bus space.

Yes, this is the late GO (Global Observability) bug. The kernel does not do 
anything on this one simply because the tradeoff is severe given the 
performance loss of the workaround. It is left to each user to decide if they 
can tolerate the slowdown while measuring.

Workarounds exist in the pmu-tools latego.py script.
https://github.com/andikleen/pmu-tools

Make sure you disable them after you count, otherwise you are hosing your 
machines performance!

$ latego.py enable mem_load_uops_retired.llc_miss
do papi stuff
$ latego.py enable mem_load_uops_retired.llc_miss

For hyperthreading, one can reduce the problem by making sure the per-thread 
mask only contains one of two threads on the same core. numactl or taskset 
ahead of time and make sure you understand the mappings. HT siblings are 
usually high-order processor numbers. But itâ€™s still thereâ€¦ the only 
foolproof way is to disable HT in BIOSâ€¦

â€” BT241 snip â€”
Due to this erratum, the Local Memory Read / Load Retired PerfMon events listed 
below may undercount.
MEM_LOAD_UOPS_RETIRED.LLC_HIT
MEM_LOAD_UOPS_RETIRED.LLC_MISS* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM* 
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM* MEM_TRANS_RETIRED.LOAD_LATENCY*

The undercount of these events can be partially resolved (but not eliminated) 
by setting MSR_PEBS_NUM_ALT. PEBS Accuracy Enable (MSR 39CH; bit 0) to 1. When 
using the events marked with an asterisk, set the Direct-to-core disable field 
(Bus 1; Device 14; Function 0; Offset 84; bit 1) to 1 for Local memory reads 
and (Bus 1; Device 8; Function 0; Offset 80; bit 1) to 1 and (Bus 1; Device 9; 
Function 0; Offset 80; bit 1) to 1 for Remote memory reads. The improved 
accuracy comes at the cost of a reduction in performance; this workaround 
generally should not be used during normal operation.
â€” snip off â€”

â€” BT243 snip â€”
When operating with SMT enabled, a memory at-retirement performance monitoring 
event (from the list below) may be dropped or may increment an enabled event on 
the corresponding counter with the same number on the physical core's other 
thread rather than the thread experiencing the event. Processors with SMT 
disabled in BIOS are not affected by this erratum

The list of affected memory at-retirement events is as follows:
MEM_UOP_RETIRED.LOADS
MEM_UOP_RETIRED.STORES
MEM_UOP_RETIRED.LOCK
MEM_UOP_RETIRED.SPLIT
MEM_UOP_RETIRED.STLB_MISS MEM_LOAD_UOPS_RETIRED.HIT_LFB 
MEM_LOAD_UOPS_RETIRED.L1_HIT MEM_LOAD_UOPS_RETIRED.L2_HIT 
MEM_LOAD_UOPS_RETIRED.LLC_HIT MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE MEM_LOAD_UOPS_RETIRED.LLC_MISS 
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM 
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM MEM_LOAD_UOPS_RETIRED.L2_MISS
â€” snip off â€”

Yes, this is the infamous HT bug causing cross HT counter corruption. If any of 
these events is measure on counterX in one HT, then counterX on the sibling HT 
may get corrupted.
For this problem, we have developed a kernel workaround which has been accepted 
in Linux 4.1 kernel. There will be a presentation on this work at SC16.
The workaround avoid the corruption on the sibling counter. But it does not 
correct the leak from the corrupting counter.
For all I know, this workaround may have been backported by Redhat and other 
distro to older kernels.

fp_comp_ops_exe.sse_scalar_single
fp_comp_ops_exe.sse_packed_single


As far as these go, there is no known issues with them from Intel AFAICT. If 
Mr. Bandwidth aka the famous John McCalpin is lurking here, he might have 
something to add. Some dated microbenchmarks seem to validate their counting. 
<https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops> 
https://icl.cs.utk<https://icl.cs.utk/>.edu/projects/papi/wiki/PAPITopics:SandyFlops

I believe the FLOPS events were fixed in Broadwell and clearly documented in 
their event files 
here<https://download.01.org/perfmon/BDW/Broadwell_FP_ARITH_INST_V16.json>.

Haswell (HSX):
cycle_activity.cycles_l1d_pending
cycle_activity.stalls_l1d_pending

For these events, there is likely a bug in the released kernel scheduling it on 
the wrong counter. See https://github.com/andikleen/pmu-tools/issues/18

Yes, and it was fixed in Linux 4.0.

mem_load_uops_l3_hit_retired.xsnp_hit
mem_load_uops_l3_hit_retired.xsnp_hitm
mem_load_uops_l3_hit_retired.xsnp_miss
mem_load_uops_l3_miss_retired.remote_dram
mem_load_uops_l3_miss_retired.remote_fwd
mem_load_uops_l3_miss_retired.remote_hitm

Here again, there are two errata, HSM26 (this time no workaround) and HSM30 
(hyperthreading).
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-mobile-specification-update.pdf

Reproduced here below:

â€”snip HSM26 â€”
Certain Local Memory Read / Load Retired PerfMon Events May

Undercount

Due to this erratum, the Local Memory Read / Load Retired PerfMon events listed 
below may undercount.
MEM_LOAD_UOPS_RETIRED.L3_HIT (Event D1H Umask 04H) 
MEM_LOAD_UOPS_RETIRED.L3_MISS (Event D1H Umask 20H) 
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS (Event D2H Umask 01H) 
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT (Event D2H Umask 02H) 
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM (Event D2H Umask 04H) 
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE (Event D2H Umask 08H) 
MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM (Event D3H Umask 01H) 
MEM_TRANS_RETIRED.LOAD_LATENCY (Event CDH Umask 01H) PAGE_WALKER_LOADS.DTLB_L3 
(Event BCH Umask 14H) PAGE_WALKER_LOADS.ITLB_L3 (Event BCH Umask 24H) 
PAGE_WALKER_LOADS.DTLB_Memory (Event BCH Umask 18H) 
PAGE_WALKER_LOADS.ITLB_Memory (Event BCH Umask 28H)

The affected events may undercount, resulting in inaccurate memory profiles. 
Intel has observed undercounts by as much as 40%.
â€” snip â€”

â€” snip HSM30 â€”
Performance Monitor Counters May Produce Incorrect Results

When operating with SMT enabled, a memory at-retirement performance monitoring 
event (from the list below) may be dropped or may increment an enabled event on 
the corresponding counter with the same number on the physical coreâ€™s other 
thread rather than the thread experiencing the event. Processors with SMT 
disabled in BIOS are not affected by this erratum.

The list of affected memory at-retirement events is as follows:

MEM_UOP_RETIRED.LOADS MEM_UOP_RETIRED.STORES MEM_UOP_RETIRED.LOCK 
MEM_UOP_RETIRED.SPLIT MEM_UOP_RETIRED.STLB_MISS MEM_LOAD_UOPS_RETIRED.HIT_LFB 
MEM_LOAD_UOPS_RETIRED.L1_HIT MEM_LOAD_UOPS_RETIRED.L2_HIT 
MEM_LOAD_UOPS_RETIRED.L3_HIT MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT 
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS 
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE MEM_LOAD_UOPS_RETIRED.L3_MISS 
MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM 
MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM MEM_LOAD_UOPS_RETIRED.L2_MISS

Due to this erratum, certain performance monitoring event will produce 
unreliable results during hyper-threaded operation.
â€” snip â€”

Fixed by kernel workaround in 4.1



uops_issued_single_mul

This event is missing a period, itâ€™s called uops_issued.single_mul. This 
event is very likely a kernel scheduling bug. Although I donâ€™t know if 
anyoneâ€™s ever tested this event and the Intel documentation does not clarify 
what packed means here, and whether it applies to x87, SSE or AVX. So itâ€™s 
usefulness is TBD.

This event is not marked with any constraints in the official event table. Are 
you saying it always counts to 0?

Hope this helps.
Not sure I can post on the PAPI mailing list. If not, please forward to this 
list.
Thanks.


_______________________________________________
Ptools-perfapi mailing list
[email protected]<mailto:[email protected]>
http://lists.eecs.utk.edu/mailman/listinfo/ptools-perfapi

_______________________________________________
Ptools-perfapi mailing list
[email protected]
http://lists.eecs.utk.edu/mailman/listinfo/ptools-perfapi

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

[hpx-users] using papi on intel processors Fwd: [Ptools-perfapi] Notes on important errata on Intel processors...

Reply via email to