[perfmon2] Fwd: [patch] Performance Counters for Linux, v2

stephane eranian Tue, 09 Dec 2008 11:07:20 -0800

---------- Forwarded message ----------
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Tue, Dec 9, 2008 at 2:46 PM
Subject: Re: [patch] Performance Counters for Linux, v2
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], Thomas Gleixner
<[EMAIL PROTECTED]>, [EMAIL PROTECTED], Andrew Morton
<[EMAIL PROTECTED]>, Eric Dumazet <[EMAIL PROTECTED]>,
Robert Richter <[EMAIL PROTECTED]>, Arjan van de Veen
<[EMAIL PROTECTED]>, Peter Anvin <[EMAIL PROTECTED]>, Peter Zijlstra
<[EMAIL PROTECTED]>, Steven Rostedt <[EMAIL PROTECTED]>, David
Miller <[EMAIL PROTECTED]>, Paul Mackerras <[EMAIL PROTECTED]>, Paolo
Ciarrocchi <[EMAIL PROTECTED]>

* stephane eranian <[EMAIL PROTECTED]> wrote:

> > There's a new "counter group record" facility that is a
> > straightforward extension of the existing "irq record" notification
> > type. This record type can be set on a 'master' counter, and if the
> > master counter triggers an IRQ or an NMI, all the 'secondary'
> > counters are read out atomically and are put into the counter-group
> > record. The result can then be read() out by userspace via a single
> > system call. (Based on extensive feedback from Paul Mackerras and
> > David Miller, thanks guys!)
>
> That is unfortunately not generic enough. You need a bit more
> flexibility than master/secondaries, I am afraid.  What tools want is
> to be able to express:
>
>    - when event X overflows, record values of events  J, K
>    - when event Y overflows, record values of events  Z, J

hm, the new group code in perfcounters-v2 can already do this. Have you
tried to use it and it didnt work? If so then that's a bug. Nothing in
the design prevents that kind of group readout.

[ We could (and probably will) enhance the grouping relationship some
 more, but group readouts are a fundamentally inferior mode of
 profiling. (see below for the explanation) ]

> I am not making this up. I know tools that do just that, i.e., that is
> collecting two distinct profiles in a single run. This is how, for
> instance, you can collect a flat profile and the call graph in one run,
> very much like gprof.

yeah, but it's still the fundamentally wrong thing to do.

Being able to extract high-quality performance information from the
system is the cornerstone of our design, and chosing the right sampling
model permeates the whole issue of single-counter versus group-readout.

I dont think finer design aspects of kernel support for performance
counters can be argued without being on the same page about this, so
please let me outline our view on these things, in (boringly) verbose
detail - spiked with examples and code as well.

Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
wrong mindset - and cross-sampling counters is a similarly wrong mindset.

When there are two (or more) hw metrics to profile, the ideally best
(i.e. the statistically most stable and most relevant) sampling for the
two statistical variables (say of l2_misses versus l2_accesses) is to
sample them independently, via their own metric. Not via a static 1khz
rate - or via picking one of the variables to generate samples.

[ Sidenote: as long as the hw supports such sort of independent sampling
 - lets assume so for the sake of argument - not all CPUs are capable of
 that - most modern CPUs do though. ]

Static frequency [time] sampling has a number of disadvantages that
drastically reduce its precision and reduce its utility, and 'group'
sampling where one counter controls the events has similar problems:

- It under-samples rare events such as cachemisses.

 An example: say we have a workload that executes 1 billion instructions
 a second, of which 5000 generate a cachemiss. Only one in 200,000
 instructions generates a cachemiss. The chance for a static sampling
 IRQ to hit exactly an instruction that causes the cachemiss is 1:200
 (0.5%) in every second. That is very low probability, and the profile
 would not be very helpful - even though it samples at a seemingly
 adequate frequency of 1000 events per second!

 With per event counters and per event sampling that KernelTop uses, we
 get an event next to the instruction that causes a cachemiss with a
 100% certainty, all the time. The profile and its per instruction
 aspects suddenly become a whole lot more accurate and whole lot more
 interesting.

- Static frequency and group sampling also runs the risk of systematic
 error/skew of sampling if any workload component has any correlation
 with the "1msec" global sampling period.

 For example: say we profile a workload that runs a timer every 20
 msecs. In such a case the profile could be skewed assymetrically
 against [or in favor of] that timer activity that it does every 10
 milliseconds.

 Good sampling wants the samples to be generated in proportion to the
 variable itself, not proportional to absolute time.

- Static sampling also over-samples when the workload activity goes
 down (when it goes more idle).

 For example: we profile a fluctuating workload that is sometimes only
 0.2% busy, i.e. running only for 2 milliseconds every second. Still we
 keep interrupting it at 1 khz - that can be a very brutal systematic
 skew if the sampling overhead is 2 microseconds, totalling to 2 msecs
 overhead every second - so 50% of what runs on the CPU will be sampling
 code - impacting/skewing the sampled code.

 Good sampling wants to 'follow' the ebb and flow of the actual hw
 events that the CPU has.

The best way to sample two metrics such as "cache accesses" and "cache
misses" (or say "cache misses" versus "TLB misses") is to sample the two
variables _independently_, and to build independent histograms out of
them.

The combination (or 'grouping') of the measured variables is thus done at
the output stage _after_ data acquisition, to provide a weighted
histogram (or a split-view double histogram).

For example, in a "l2 misses" versus "l2 accesses" case, the highest
quality of sampling is to use two independent sampling IRQs with such
sampling parameters:

 - one notification every     200 L2 cache misses
 - one notification every  10,000 L2 cache accesses

[ this is a ballpark figure - the sample rate is a function of the
 averages of the workload and the characteristics of the CPU. ]

And at the output stage display a combination of:

 l2_accesses[pc]
 l2_misses[pc]
 l2_misses[pc] / l2_accesseses[pc]

Note that if we had a third variable as well - say icache_misses[], we
could combine the three metrics:

 l2_misses[pc] / l2_accesses[pc] / icache_misses[pc]

 ( such a view expresses the miss/access ratio in a branch-weighted
   fashion: it weighs down instructions that also show signs of icache
   pressure and goes for the functions with a high dcache rate but low
   icache pressure - i.e. commonly executed functions with a high data
   miss rate. )

Sampling at a static frequency is acceptable as well in some cases, and
will lead to an output that is usable for some things. It's just not the
best sampling model, and it's not usable at all for certain important
things such as highly derived views, good instruction level profiles or
rare hw events.

I've uploaded a new version of kerneltop.c that has such a multi-counter
sampling model that follows this statistical model:

   http://redhat.com/~mingo/perfcounters/kerneltop.c

Example of usage:

I've started a tbench 64 localhost workload on a 16way x86 box. I want to
check the miss/refs ratio. I first did a sample one of the metrics,
cache-references:

$ ./kerneltop -e 2 -c 100000 -C 2

------------------------------------------------------------------------------
 KernelTop:    1311 irqs/sec  [NMI, 10000 cache-refs],  (all, cpu: 2)
------------------------------------------------------------------------------

            events         RIP          kernel function
            ______   ________________   _______________

           5717.00 - ffffffff803666c0 : copy_user_generic_string!
            355.00 - ffffffff80507646 : tcp_sendmsg
            315.00 - ffffffff8050abcb : tcp_ack
            222.00 - ffffffff804fbb20 : ip_rcv_finish
            215.00 - ffffffff8020a75b : __switch_to
            194.00 - ffffffff804d0b76 : skb_copy_datagram_iovec
            187.00 - ffffffff80502b5d : __inet_lookup_established
            183.00 - ffffffff8051083d : tcp_transmit_skb
            160.00 - ffffffff804e4fc9 : eth_type_trans
            156.00 - ffffffff8026ae31 : audit_syscall_exit

Then i checked the characteristics of the other metric [cache-misses]:

$ ./kerneltop -e 3 -c 200 -C 2

------------------------------------------------------------------------------
 KernelTop:    1362 irqs/sec  [NMI, 200 cache-misses],  (all, cpu: 2)
------------------------------------------------------------------------------

            events         RIP          kernel function
            ______   ________________   _______________

           1419.00 - ffffffff803666c0 : copy_user_generic_string!
           1075.00 - ffffffff804e4fc9 : eth_type_trans
           1059.00 - ffffffff804d8baa : dst_release
            949.00 - ffffffff80510004 : tcp_established_options
            841.00 - ffffffff804fbb20 : ip_rcv_finish
            569.00 - ffffffff804ce808 : skb_push
            454.00 - ffffffff80502b5d : __inet_lookup_established
            453.00 - ffffffff805001a3 : ip_queue_xmit
            298.00 - ffffffff804cf5d8 : skb_release_head_state
            247.00 - ffffffff804ce74b : skb_copy_and_csum_dev

then, to get the "combination" view of the two counters, i appended the
two command lines:

 $ ./kerneltop -e 3 -c 200 -e 2 -c 10000 -C 2

------------------------------------------------------------------------------
 KernelTop:    2669 irqs/sec  [NMI, cache-misses/cache-refs],  (all, cpu: 2)
------------------------------------------------------------------------------

            weight         RIP          kernel function
            ______   ________________   _______________

             35.20 - ffffffff804ce74b : skb_copy_and_csum_dev
             33.00 - ffffffff804cb740 : sock_alloc_send_skb
             31.26 - ffffffff804ce808 : skb_push
             22.43 - ffffffff80510004 : tcp_established_options
             19.00 - ffffffff8027d250 : find_get_page
             15.76 - ffffffff804e4fc9 : eth_type_trans
             15.20 - ffffffff804d8baa : dst_release
             14.86 - ffffffff804cf5d8 : skb_release_head_state
             14.00 - ffffffff802217d5 : read_hpet
             12.00 - ffffffff804ffb7f : __ip_local_out
             11.97 - ffffffff804fc0c8 : ip_local_deliver_finish
              8.54 - ffffffff805001a3 : ip_queue_xmit

[ It's interesting to see that a seemingly common function,
 copy_user_generic_string(), got eliminated from the top spots - because
 there are other functions whose relative cachemiss rate is far more
 serious. ]

The above "derived" profile output is relatively stable under kerneltop
with the use of ~2600 sample irqs/sec and the 2 seconds default refresh.
I'd encourage you to try to achieve the same quality of output with
static 2600 hz sampling - it wont work with the kind of event rates i've
worked with above, no matter whether you read out a single counter or a
group of counters, atomically or not. (because we just dont get
notification PCs at the relevant hw events - we get PCs with a time
sample)

And that is just one 'rare' event type (cachemisses) - if we had two such
sources (say l2 cachemisses and TLB misses) then such type of combined
view would only be possible if we got independent events from both
hardware events.

And note that once you accept that the highest quality approach is to
sample the hw events independently, all the "group readout" approaches
become a second-tier mechanism. KernelTop uses that model and works just
fine without any group readout and it is making razor sharp profiles,
down to the instruction level.

[ Note that there's special-cases where group-sampling can limp along
 with acceptable results: if one of the two counters has so many events
 that sampling by time or sampling by the rare event type gives relevant
 context info. But the moment both event sources are rare, the group
 model breaks down completely and produces meaningless results. It's
 just a fundamentally wrong kind of abstraction to mix together
 unrelated statistical variables. And that's one of the fundamental
 design problems i see with perfmon-v3. ]

       Ingo

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

[perfmon2] Fwd: [patch] Performance Counters for Linux, v2

Reply via email to