Paul, Thanks for taking the time to describe in very detail what IBS actually does, how it should be used and how it is going to evolve.
I found the information quite interesting, there are certain aspects I did not know. I will forward your message onto the LKML thread. The people I was having this conversation with are not on the perfmon2 mailing list, I had just cc'ed the list. Thanks. On Wed, Jun 24, 2009 at 8:20 PM, Drongowski, Paul<paul.drongow...@amd.com> wrote: > > Hi -- > > I'm sorry to be joining this discussion so late. A few of my > colleagues pointed me toward the current thread on IBS and I've tried > to catch up by reading the archives. A short self-introduction: I'm a > member of the AMD CodeAnalyst team, Ravi Bhargava and I wrote Appendix G > (concerning IBS) of the AMD Software Optimization Guide for AMD > Family 10h Processors and at one point in my life, I worked on DCPI > (using ProfileMe). > > First off, Stephane and Rob have done a good job representing IBS and > also ProfileMe. Thanks, guys! > > Rather than grossly disturb the current discussion, I'd like to offer > a few points of clarification and maybe a little useful history. > > Peter's observation that IBS is a "mismatch with the traditional one > value per counter thing" is quite apt. IBS has similarities to > ProfileMe. Stephane's citation of the Itanium Data-EAR and > Instruction-EAR are also very relevant as examples of profile data > that do not fit with the "one value per counter thing." > > IBS Fetch. > > IBS fetch sampling does not exactly sample x86 instructions. The > current fetch counter counts fetch operations where a fetch > operation > may be a 32-byte fetch block (on AMD Family 10h) or it may be a > fetch operation initiated by a redirection such as a branch. > A fetch block is 32 bytes of instruction information which is > sent to the instruction decoder. The fetch address that is reported > may either be the start of a valid x86 instruction or the start of > a fetch block. In the second case, the address may be in the middle > of > an x86 instruction. > > IBS fetch sampling produces a number of event flags (e.g., > instruction > cache miss), but it also produces the latency (in cycles) of the > fetch operation. The latencies can be accumulated in either > descriptive statistics, or better, in a histogram since descriptive > statistics don't really show where an access is hitting in the > memory hierarchy. BTW, even though an IBS fetch sample may be > reported, > the decoder may not use the instruction bytes due to a late arriving > redirection. > > IBS Op. > > IBS op sampling does not sample x86 instructions. It samples the > ops which are issued from x86 instructions. Some x86 instructions > issue more than one op. Microcoded instructions are particularly > thorny as a single REP MOV may issue many ops, thereby affecting > the number of samples that fall on them (i.e., disproportionate to > the > execution frequency of the surrounding basic block.) The number of > ops issued is data dependent and is unpredictable. Appendix C > of the Software Optimization Guide lists the number of ops issued > from x86 instructions (one, two or many). > > Beginning with AMD Family 10h RevC, there are two op selection > (counting) modes for IBS: cycles-counting and dispatched op > counting. > > Cycles-counting is _not_ equivalent to CPU_CLK_UNHALTED -- it is > not a precise version of the performance monitoring counter (PMC) > event (event select 0x076). In cycles-mode, when the current count > reaches the max count, the next available dispatch group of ops is > selected and a secondary mechanism selects an op within the dispatch > group. The dispatch group may contain one, two or three ops. If you > smell a rat, you're right. The secondary scheme negatively affects > the desired pseudo-random selection scheme. Also, if a dispatch > group is not available, the sample is skipped and the counting > process is reset. > > Further, cycles-mode selection is affected by pipeline stalls. This > affects the distribution of IBS op samples taken in cycles-mode. > With cycles-mode, one instruction may have more data cache miss > events, > but the underlying sampling basis is so skewed that the comparison > is > not meaningful. IBS op samples are generated only for ops that > retire; > tagged ops on a "wrong path" are flushed without producing a sample. > Overall, I cannot personally say that IBS cycles-mode produces a > precise > equivalent to CPU_CLK_UNHALTED. I cannot endorse or recommend > its use in this way. > > Given these issues, dispatched op counting was added in RevC. This > mode > is the _preferred_ mode. Ops are counted as they are dispatched and > the > op that triggers the max count threshold is selected and tagged. > Dispatched op mode produces a distribution of op samples that > reflects > the execution frequency of instructions/basic blocks. DirectPath > Double and VectorPath (microcoded) x86 instructions which issue more > than > one op will still be oversampled, however. The distribution is > important > because it allows meaningful comparison of event counts between > instructions. > > Even though the distribution of samples in dispatched op mode > reflects > execution frequency, it is not a substitute for RETIRED_INSTRUCTIONS > (event select 0x0c0). The number of IBS op samples in some > workloads, > especially those with certain kinds of stack access and microcoded > instructions, diverges greatly from RETIRED_INSTRUCTIONS. > > IBS is what it is. > > IBS derived events > > Since ProfileMe and Data EAR didn't exactly take the world by storm, > (oh, yeah, I worked with HP Caliper on Itanium for a while, too ;-), > profiling infrastructures like OProfile and CodeAnalyst are largely > based on the PMC sampling model. > > In order to get IBS into practice as quickly as possible, we defined > IBS derived events. This allowed us to implement basic support for > IBS in both OProfile and CodeAnalyst without major changes in > infrastructure. I should note that translation from raw IBS bits to > derived events is and was always intended to be performed by user > space tools. I personally believe that translation should not be > performed in the kernel -- kernel support should be simple and > lightweight. > > An IBS op sample is a small "packet" of profile data: > > A bunch of event flags (data cache miss, etc.) > Tag-to-retire time (cycles) > Completion-to-retire (cycles) > DC miss latency (cycles) > DC miss addresses (64-bit virtual and physical addresses) > > These entities can be used to compute latency distributions, > memory access maps, etc. IBS enables new kinds of analysis such > as data-centric profiling that identifies hot data regions (that > could be used to tune data layout in NUMA environment). > > Quite frankly, at this juncture, I find the derived event model to > be > too limiting. DCPI had a much different way of organizing ProfileMe > data that allowed flexible formulation of queries during > post-processing -- > something that cannot be done with the derived event approach. > > Further, the organization and use of DC miss addresses is open for > investigation. I would _love_ to encourage someone (anyone? anyone?) > to take up this investigation. There may also be unforeseen uses -- > perhaps driving compile-time optimizations. The existing derived > events > do not adequately support new applications of IBS data. Thus, I > would > encourage kernel-level support that passes IBS data along without > modification. > > Filtering. > > After our initial experience with IBS, we see the need for > filtering. > One approach is to collect and report only those IBS register values > that are needed to support a certain kind of analysis. For example, > if the DC miss addresses are not needed, why collect them? Suravee > and Robert Richter (both terrific colleagues) have been > investigating > this, so I will defer to their analysis and comments. > > Software randomization. > > We've found that software randomization of the sampling period > and/or > current count is needed to avoid certain situations where the > pipeline > and the sampling process get into a periodic hard-loop that affects > the distribution of IBS op samples. BTW, forcing those low order > four > bits to zero occasionally has a negative effect on op distribution. > > IBS future extensions > > Of course, I can't discuss specific new features. However, here are > some possible variations: > > * The current count and max count values may become longer. > * New event flags may be added. > * Existing event flags may be left out (i.e., not implemented > in a family or model) > * New ancillary data (like DC miss latency or DC miss address) > may be added. > > It may be necessary to collect new 64-bit values that do not contain > event flags, for example. > > Thanks for enduring this long-winded message. I hope that I've > communicated some information and requirements, and I'll be more than > happy to answer questions about IBS (or get the answers). > > -- pj > > Dr. Paul Drongowski > AMD CodeAnalyst team > Boston Design Center > > ------------------------- > The information presented in this reply is for informational purposes > only and may contain technical inaccuracies, omissions and > typographical errors. Links to third party sites are for convenience > only, and no endorsement is implied. > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > perfmon2-devel mailing list > perfmon2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > ------------------------------------------------------------------------------ _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel