Hi -- I'm sorry to be joining this discussion so late. A few of my colleagues pointed me toward the current thread on IBS and I've tried to catch up by reading the archives. A short self-introduction: I'm a member of the AMD CodeAnalyst team, Ravi Bhargava and I wrote Appendix G (concerning IBS) of the AMD Software Optimization Guide for AMD Family 10h Processors and at one point in my life, I worked on DCPI (using ProfileMe).
First off, Stephane and Rob have done a good job representing IBS and also ProfileMe. Thanks, guys! Rather than grossly disturb the current discussion, I'd like to offer a few points of clarification and maybe a little useful history. Peter's observation that IBS is a "mismatch with the traditional one value per counter thing" is quite apt. IBS has similarities to ProfileMe. Stephane's citation of the Itanium Data-EAR and Instruction-EAR are also very relevant as examples of profile data that do not fit with the "one value per counter thing." IBS Fetch. IBS fetch sampling does not exactly sample x86 instructions. The current fetch counter counts fetch operations where a fetch operation may be a 32-byte fetch block (on AMD Family 10h) or it may be a fetch operation initiated by a redirection such as a branch. A fetch block is 32 bytes of instruction information which is sent to the instruction decoder. The fetch address that is reported may either be the start of a valid x86 instruction or the start of a fetch block. In the second case, the address may be in the middle of an x86 instruction. IBS fetch sampling produces a number of event flags (e.g., instruction cache miss), but it also produces the latency (in cycles) of the fetch operation. The latencies can be accumulated in either descriptive statistics, or better, in a histogram since descriptive statistics don't really show where an access is hitting in the memory hierarchy. BTW, even though an IBS fetch sample may be reported, the decoder may not use the instruction bytes due to a late arriving redirection. IBS Op. IBS op sampling does not sample x86 instructions. It samples the ops which are issued from x86 instructions. Some x86 instructions issue more than one op. Microcoded instructions are particularly thorny as a single REP MOV may issue many ops, thereby affecting the number of samples that fall on them (i.e., disproportionate to the execution frequency of the surrounding basic block.) The number of ops issued is data dependent and is unpredictable. Appendix C of the Software Optimization Guide lists the number of ops issued from x86 instructions (one, two or many). Beginning with AMD Family 10h RevC, there are two op selection (counting) modes for IBS: cycles-counting and dispatched op counting. Cycles-counting is _not_ equivalent to CPU_CLK_UNHALTED -- it is not a precise version of the performance monitoring counter (PMC) event (event select 0x076). In cycles-mode, when the current count reaches the max count, the next available dispatch group of ops is selected and a secondary mechanism selects an op within the dispatch group. The dispatch group may contain one, two or three ops. If you smell a rat, you're right. The secondary scheme negatively affects the desired pseudo-random selection scheme. Also, if a dispatch group is not available, the sample is skipped and the counting process is reset. Further, cycles-mode selection is affected by pipeline stalls. This affects the distribution of IBS op samples taken in cycles-mode. With cycles-mode, one instruction may have more data cache miss events, but the underlying sampling basis is so skewed that the comparison is not meaningful. IBS op samples are generated only for ops that retire; tagged ops on a "wrong path" are flushed without producing a sample. Overall, I cannot personally say that IBS cycles-mode produces a precise equivalent to CPU_CLK_UNHALTED. I cannot endorse or recommend its use in this way. Given these issues, dispatched op counting was added in RevC. This mode is the _preferred_ mode. Ops are counted as they are dispatched and the op that triggers the max count threshold is selected and tagged. Dispatched op mode produces a distribution of op samples that reflects the execution frequency of instructions/basic blocks. DirectPath Double and VectorPath (microcoded) x86 instructions which issue more than one op will still be oversampled, however. The distribution is important because it allows meaningful comparison of event counts between instructions. Even though the distribution of samples in dispatched op mode reflects execution frequency, it is not a substitute for RETIRED_INSTRUCTIONS (event select 0x0c0). The number of IBS op samples in some workloads, especially those with certain kinds of stack access and microcoded instructions, diverges greatly from RETIRED_INSTRUCTIONS. IBS is what it is. IBS derived events Since ProfileMe and Data EAR didn't exactly take the world by storm, (oh, yeah, I worked with HP Caliper on Itanium for a while, too ;-), profiling infrastructures like OProfile and CodeAnalyst are largely based on the PMC sampling model. In order to get IBS into practice as quickly as possible, we defined IBS derived events. This allowed us to implement basic support for IBS in both OProfile and CodeAnalyst without major changes in infrastructure. I should note that translation from raw IBS bits to derived events is and was always intended to be performed by user space tools. I personally believe that translation should not be performed in the kernel -- kernel support should be simple and lightweight. An IBS op sample is a small "packet" of profile data: A bunch of event flags (data cache miss, etc.) Tag-to-retire time (cycles) Completion-to-retire (cycles) DC miss latency (cycles) DC miss addresses (64-bit virtual and physical addresses) These entities can be used to compute latency distributions, memory access maps, etc. IBS enables new kinds of analysis such as data-centric profiling that identifies hot data regions (that could be used to tune data layout in NUMA environment). Quite frankly, at this juncture, I find the derived event model to be too limiting. DCPI had a much different way of organizing ProfileMe data that allowed flexible formulation of queries during post-processing -- something that cannot be done with the derived event approach. Further, the organization and use of DC miss addresses is open for investigation. I would _love_ to encourage someone (anyone? anyone?) to take up this investigation. There may also be unforeseen uses -- perhaps driving compile-time optimizations. The existing derived events do not adequately support new applications of IBS data. Thus, I would encourage kernel-level support that passes IBS data along without modification. Filtering. After our initial experience with IBS, we see the need for filtering. One approach is to collect and report only those IBS register values that are needed to support a certain kind of analysis. For example, if the DC miss addresses are not needed, why collect them? Suravee and Robert Richter (both terrific colleagues) have been investigating this, so I will defer to their analysis and comments. Software randomization. We've found that software randomization of the sampling period and/or current count is needed to avoid certain situations where the pipeline and the sampling process get into a periodic hard-loop that affects the distribution of IBS op samples. BTW, forcing those low order four bits to zero occasionally has a negative effect on op distribution. IBS future extensions Of course, I can't discuss specific new features. However, here are some possible variations: * The current count and max count values may become longer. * New event flags may be added. * Existing event flags may be left out (i.e., not implemented in a family or model) * New ancillary data (like DC miss latency or DC miss address) may be added. It may be necessary to collect new 64-bit values that do not contain event flags, for example. Thanks for enduring this long-winded message. I hope that I've communicated some information and requirements, and I'll be more than happy to answer questions about IBS (or get the answers). -- pj Dr. Paul Drongowski AMD CodeAnalyst team Boston Design Center ------------------------- The information presented in this reply is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied. ------------------------------------------------------------------------------ _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel