Summary: Please add support for 'Lightweight Profiling' which
                    adds a set of user-controlled counters to the AMD64
           Product: D
           Version: future
          Platform: x86_64
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: DMD

--- Comment #0 from nick barbalich <> 2010-01-25 
16:44:15 PST ---
Late in 2007, AMD announced Lightweight Profiling as a proposed extension to
the AMD64 architecture that would allow an application to gather performance
statistics about itself with low overhead. We [AMD] posted the preliminary
specification and asked for feedback from the developer community. Much to our
delight, many of you responded with comments, criticisms, and suggestions on
the proposal. We've read all of your feedback, and last week we posted the
current version of the LWP specification. The announcement and the link to the
spec are here. Thanks to all of you who helped us out.

What came before...

It's important to be able to measure the details of a program's performance in
order to find ways to speed it up. Until now, there have been just two ways to
do this. The first is via instrumentation, i.e., adding code to the program to
watch the clock, or the cycle counter, or just to count the number of times an
instruction or loop is executed. Instrumentation can be added by the programmer
or by a compiler. Unfortunately, it seriously perturbs the application, and the
instrumented code usually doesn't have the same characteristics as the original
code, especially when dealing with the data and instruction caches. Also,
instrumentation can't observe the hardware caches, so it can't gather data
about cache behavior.

The second traditional method of monitoring performance is to use the hardware
performance counters. These count hardware events and generate an interrupt
after a programmed number of events have happened. The counters can report on
events that are too hard to instrument (like counting each x86 instruction) or
are not visible to software (like cache misses). These counters are used by the
AMD CodeAnalyst Performance Analyzer and provide deep insight into application
and system performance. However, each time a data sample is gathered, the
processor must take an interrupt to a kernel-mode driver, and that takes
hundreds or thousands of cycles. The driver, by simply executing, changes the
contents of the data cache and the instruction cache and may perturb the
application's performance. The counters can only be configured, started, and
stopped from kernel mode, so an application must call a driver or the operating
system to control them. Finally, some systems do not context-switch the
performance counters when changing threads or processes, and on those systems,
performance monitoring can only be done globally by a single user at a time.

Introducing LWP

After reading about current technology, you might think that an ideal
performance monitor should:

    * Operate entirely in user mode
    * Cause little or no perturbation of the application
    * Be controlled separately for each thread
    * Have low overhead to allow for higher sampling rates

And that describes LWP!

Lightweight Profiling adds a set of user-controlled counters to the AMD64
architecture. They can monitor multiple events simultaneously. An application
thread starts profiling by providing the address of an LWP control block
(LWPCB) as the operand to the new LLWPCB instruction. The contents of the LWPCB
specify which events to count and how often to count them. It also points to a
ring buffer in the application's memory into which the hardware will store
event records. That's it.

Once started, LWP counts the specified events. When an event counter
underflows, it stores an event record at the head of the ring buffer and resets
the counter. (If requested, LWP randomizes the bottom bits of the new counter
value to prevent "beating" against constant length loops.) LWP stores the
record without interrupting the flow of the program, so the only perturbation
to the program's performance is writing the record (usually affecting only a
single data cache line) and a few cycles to perform the write. The record
contains the event type, the address of the instruction that caused the
underflow, and other information about the event. All event types share one
ring buffer and can be sorted out by the event type field in the record.

Of course, eventually the buffer will fill up. What then? Well, a program has
two options for emptying the ring buffer. First, it can simply poll the buffer
and remove event records from the tail of the ring. When software rewrites the
tail pointer, the LWP hardware knows it can reuse the newly emptied region of
the ring buffer. Since the buffer is in user memory, the program can even share
the memory with another process, and that second process can be responsible for
draining the buffer. Second, the application can specify that it wants LWP to
generate an interrupt when the ring buffer is filled past a certain threshold.
For instance, it can configure a buffer to hold 10,000 event records and tell
LWP to interrupt whenever there are more than 9,000 records in the buffer. The
interrupt does indeed perturb the program, but it does so 1/9000th as often as
the traditional performance counters would. Better still, since the buffer is
in user memory, the application can catch the interrupt and do whatever it
wants with the data. It can store it to disk for later analysis, or it can
process it immediately and even try to fix performance problems as they are

In addition, LWP is a per-thread feature. Each thread on the system can be
monitoring different events at different rates without interference. If a
thread is not using LWP, there is no impact on its performance even if other
threads have LWP active.

Some LWP Details

The LWP events are a small subset of the events available in the traditional
performance counters. They include Instructions Retired, Branches Retired, and
DCache Misses. The Branches Retired event can be filtered by whether the branch
is direct or indirect, conditional or unconditional, or other criteria. It
captures the target address of the branch, a useful value when looking at
indirect branches. The DCache miss event can be filtered by cache level to
capture only "expensive" cache misses.

One exciting feature of LWP is the ability to insert events into the ring
buffer under program control. There are two new instructions to do this:

    * LWPINS inserts a record into the ring buffer containing data taken from
the arguments to the instruction. A program can use LWPINS to insert a marker
to indicate an important event, such as loading or unloading a shared library,
that influences the way addresses should be interpreted in subsequent event
    * LWPVAL uses an event counter and decrements the counter each time it is
executed, much the way the hardware event counters work. When the counter
underflows, it inserts a record into the ring buffer containing data from its
arguments. A program uses LWPVAL to implement a technique called value
profiling. For instance, it can profile the divisor of a commonly executed DIV
instruction and if the data show that the divisor is frequently the same
number, it can rewrite the instruction to test for that value and execute an
optimized code sequence. Similarly, it can profile the target of a hot indirect
branch and generate better code if one way of the branch is dominant.

Who will use LWP?

LWP can be used in many different application environments. These include:

    * Managed Runtime Environment: Managed Runtimes (MRTEs) are programming
environments such as Java and the Microsoft� .NET Framework. These environments
have the ability to generate AMD x86 or x64 code for routines coded in a high
level managed language (such as Java or C#), and they can do that on the fly as
a program is running. The MRTE can enable LWP and periodically look for
performance problems. If (when!) it finds them, it can generate better code for
the hot spots and improve the program's overall performance. LWP is lightweight
enough that it can run continuously.
    * Dynamic Optimizer: A Dynamic Optimizer is a program that monitors an
application and attempts to improve its performance by modifying it as it runs.
In this case, the target application is compiled to native code from a
traditional language like C or C++. The Dynamic Optimizer can gather
performance data without affecting the flow of control in the application.
    * Compiler Feedback: Most modern compilers have an option to build an
instrumented program which the developer runs to gather information on the
program's performance. Unfortunately, the added instrumentation (and the fact
that optimization levels are often cranked down in a feedback compilation)
perturbs the program so much that what's being measured is substantially
different from the "real" program. With LWP, the compiler can gather statistics
on the program execution without changes, and it can insert LWPVAL instructions
to profile interesting areas without adding a large block of instrumentation
code and without clobbering any registers. If the application runs without
turning on LWP, the LWPVAL instructions act as NOPs and only take a few cycles.

Note the above has been taken from:

The latest revision of the Lightweight Profiling specification document (v3.03)
is a specification containing updates that are a direct result of AMD community
feedback, and can be found here:

Configure issuemail:
------- You are receiving this mail because: -------

Reply via email to