> 4/ Intel PEBS > > Since Netburst-based processors, Intel PMUs support a hardware > sampling buffer mechanism called PEBS. > > PEBS really became useful with Nehalem. > > Not all events support PEBS. Up until Nehalem, only one counter > supported PEBS (PMC0). The format of the hardware buffer has > changed between Core and Nehalem. It is not yet architected, thus > it can still evolve with future PMU models. > > On Nehalem, there is a new PEBS-based feature called Load Latency > Filtering which captures where data cache misses occur (similar to > Itanium D-EAR). Activating this feature requires setting a latency > threshold hosted in a separate PMU MSR. > > On Nehalem, given that all 4 generic counters support PEBS, the > sampling buffer may contain samples generated by any of the 4 > counters. The buffer includes a bitmask of registers to determine > the source of the samples. Multiple bits may be set in the > bitmask. > > How PEBS will be supported for this new API?
Note, the relevance of PEBS (or IBS) should not be over-stated: for example it fundamentally cannot do precise call-chain recording (it only records the RIP, not any of the return frames), which removes from its utility. Another limitation is that only a few basic hardware event types are supported by PEBS. Having said that, PEBS is a hardware sampling feature that is definitely saner than AMD's IBS. There's two immediate incremental uses of it in perfcounters: - it makes flat sampling lower overhead by avoiding an NMI for all sample points. - it makes flat sampled data more precise. (I.e. it can avoid the 1-2 instructions 'skidding' of a sample position, for a handful of PEBS-capable events.) As such its primary support form would be 'transparent enablement': i.e. on those (relatively few) events that are PEBS supported it would be enabled automatically, and would result in more precise (and possibly, cheaper) samples. No separate APIs are needed really - the kernel can abstract it away and can provide the user what the user wants: good and fast samples. Regarding demultiplexing on Nehalem: PEBS goes into the DS (Data Store), and indeed on Nehalem all PEBS counters 'mix' their PEBS records in the same stream of data. One possible model to support them is to set the PEBS threshold to one, and hence generate an interrupt for each PEBS record. At offset 0x90 of the PEBS record we have a snapshot of the global status register: 0x90 IA32_PERF_GLOBAL_STATUS Which tells us that relative to the previous PEBS record in the DS which counter overflowed. If this were not reliable, we could still poll all active counters for overflows and get a occasionally imprecise but still statistically meaningful and precise demultiplexing. As to enabling PEBS with the (CPU-)global latency recording filters, we can do this transparantly for every PEBS supported event, or can mandate PEBS scheduling when a PEBS only feature like load latency is requested. This means that for most purposes PEBS will be transparant. ------------------------------------------------------------------------------ Are you an open source citizen? Join us for the Open Source Bridge conference! Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250. Need another reason to go? 24-hour hacker lounge. Register today! http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel