Hi --
I'm sorry to be joining this discussion so late. A few of my
colleagues pointed me toward the current thread on IBS and I've tried
to catch up by reading the archives. A short self-introduction: I'm a
member of the AMD CodeAnalyst team, Ravi Bhargava and I wrote Appendix G
(concerning IBS) of the AMD Software Optimization Guide for AMD
Family 10h Processors and at one point in my life, I worked on DCPI
(using ProfileMe).
First off, Stephane and Rob have done a good job representing IBS and
also ProfileMe. Thanks, guys!
Rather than grossly disturb the current discussion, I'd like to offer
a few points of clarification and maybe a little useful history.
Peter's observation that IBS is a "mismatch with the traditional one
value per counter thing" is quite apt. IBS has similarities to
ProfileMe. Stephane's citation of the Itanium Data-EAR and
Instruction-EAR are also very relevant as examples of profile data
that do not fit with the "one value per counter thing."
IBS Fetch.
IBS fetch sampling does not exactly sample x86 instructions. The
current fetch counter counts fetch operations where a fetch
operation
may be a 32-byte fetch block (on AMD Family 10h) or it may be a
fetch operation initiated by a redirection such as a branch.
A fetch block is 32 bytes of instruction information which is
sent to the instruction decoder. The fetch address that is reported
may either be the start of a valid x86 instruction or the start of
a fetch block. In the second case, the address may be in the middle
of
an x86 instruction.
IBS fetch sampling produces a number of event flags (e.g.,
instruction
cache miss), but it also produces the latency (in cycles) of the
fetch operation. The latencies can be accumulated in either
descriptive statistics, or better, in a histogram since descriptive
statistics don't really show where an access is hitting in the
memory hierarchy. BTW, even though an IBS fetch sample may be
reported,
the decoder may not use the instruction bytes due to a late arriving
redirection.
IBS Op.
IBS op sampling does not sample x86 instructions. It samples the
ops which are issued from x86 instructions. Some x86 instructions
issue more than one op. Microcoded instructions are particularly
thorny as a single REP MOV may issue many ops, thereby affecting
the number of samples that fall on them (i.e., disproportionate to
the
execution frequency of the surrounding basic block.) The number of
ops issued is data dependent and is unpredictable. Appendix C
of the Software Optimization Guide lists the number of ops issued
from x86 instructions (one, two or many).
Beginning with AMD Family 10h RevC, there are two op selection
(counting) modes for IBS: cycles-counting and dispatched op
counting.
Cycles-counting is _not_ equivalent to CPU_CLK_UNHALTED -- it is
not a precise version of the performance monitoring counter (PMC)
event (event select 0x076). In cycles-mode, when the current count
reaches the max count, the next available dispatch group of ops is
selected and a secondary mechanism selects an op within the dispatch
group. The dispatch group may contain one, two or three ops. If you
smell a rat, you're right. The secondary scheme negatively affects
the desired pseudo-random selection scheme. Also, if a dispatch
group is not available, the sample is skipped and the counting
process is reset.
Further, cycles-mode selection is affected by pipeline stalls. This
affects the distribution of IBS op samples taken in cycles-mode.
With cycles-mode, one instruction may have more data cache miss
events,
but the underlying sampling basis is so skewed that the comparison
is
not meaningful. IBS op samples are generated only for ops that
retire;
tagged ops on a "wrong path" are flushed without producing a sample.
Overall, I cannot personally say that IBS cycles-mode produces a
precise
equivalent to CPU_CLK_UNHALTED. I cannot endorse or recommend
its use in this way.
Given these issues, dispatched op counting was added in RevC. This
mode
is the _preferred_ mode. Ops are counted as they are dispatched and
the
op that triggers the max count threshold is selected and tagged.
Dispatched op mode produces a distribution of op samples that
reflects
the execution frequency of instructions/basic blocks. DirectPath
Double and VectorPath (microcoded) x86 instructions which issue more
than
one op will still be oversampled, however. The distribution is
important
because it allows meaningful comparison of event counts between
instructions.
Even though the distribution of samples in dispatched op mode
reflects
execution frequency, it is not a substitute for RETIRED_INSTRUCTIONS
(event select 0x0c0). The number of IBS op samples in some
workloads,
especially those with certain kinds of stack access and microcoded
instructions, diverges greatly from RETIRED_INSTRUCTIONS.
IBS is what it is.
IBS derived events
Since ProfileMe and Data EAR didn't exactly take the world by storm,
(oh, yeah, I worked with HP Caliper on Itanium for a while, too ;-),
profiling infrastructures like OProfile and CodeAnalyst are largely
based on the PMC sampling model.
In order to get IBS into practice as quickly as possible, we defined
IBS derived events. This allowed us to implement basic support for
IBS in both OProfile and CodeAnalyst without major changes in
infrastructure. I should note that translation from raw IBS bits to
derived events is and was always intended to be performed by user
space tools. I personally believe that translation should not be
performed in the kernel -- kernel support should be simple and
lightweight.
An IBS op sample is a small "packet" of profile data:
A bunch of event flags (data cache miss, etc.)
Tag-to-retire time (cycles)
Completion-to-retire (cycles)
DC miss latency (cycles)
DC miss addresses (64-bit virtual and physical addresses)
These entities can be used to compute latency distributions,
memory access maps, etc. IBS enables new kinds of analysis such
as data-centric profiling that identifies hot data regions (that
could be used to tune data layout in NUMA environment).
Quite frankly, at this juncture, I find the derived event model to
be
too limiting. DCPI had a much different way of organizing ProfileMe
data that allowed flexible formulation of queries during
post-processing --
something that cannot be done with the derived event approach.
Further, the organization and use of DC miss addresses is open for
investigation. I would _love_ to encourage someone (anyone? anyone?)
to take up this investigation. There may also be unforeseen uses --
perhaps driving compile-time optimizations. The existing derived
events
do not adequately support new applications of IBS data. Thus, I
would
encourage kernel-level support that passes IBS data along without
modification.
Filtering.
After our initial experience with IBS, we see the need for
filtering.
One approach is to collect and report only those IBS register values
that are needed to support a certain kind of analysis. For example,
if the DC miss addresses are not needed, why collect them? Suravee
and Robert Richter (both terrific colleagues) have been
investigating
this, so I will defer to their analysis and comments.
Software randomization.
We've found that software randomization of the sampling period
and/or
current count is needed to avoid certain situations where the
pipeline
and the sampling process get into a periodic hard-loop that affects
the distribution of IBS op samples. BTW, forcing those low order
four
bits to zero occasionally has a negative effect on op distribution.
IBS future extensions
Of course, I can't discuss specific new features. However, here are
some possible variations:
* The current count and max count values may become longer.
* New event flags may be added.
* Existing event flags may be left out (i.e., not implemented
in a family or model)
* New ancillary data (like DC miss latency or DC miss address)
may be added.
It may be necessary to collect new 64-bit values that do not contain
event flags, for example.
Thanks for enduring this long-winded message. I hope that I've
communicated some information and requirements, and I'll be more than
happy to answer questions about IBS (or get the answers).
-- pj
Dr. Paul Drongowski
AMD CodeAnalyst team
Boston Design Center
-------------------------
The information presented in this reply is for informational purposes
only and may contain technical inaccuracies, omissions and
typographical errors. Links to third party sites are for convenience
only, and no endorsement is implied.
------------------------------------------------------------------------------
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel