[linuxkernelnewbies] Performance Counters for Linux, v8 [LWN.net]

Peter Teoh Mon, 06 Jul 2009 16:36:17 -0700

http://lwn.net/Articles/336542/

We are pleased to announce version 8 of the performance counters 
subsystem for Linux.


This new subsystem adds a new system call (sys_perf_counter_open()) 
and it provides the new 'perf' tool that makes use of these new 
kernel capabilities.

This subsystem and this tool is new in that it tries a new approach 
at integrating all things performance analysis under one roof.

There have been many changes since -v7 - see the shortlog below for 
details.

There are a lot of new contributors to this code. Many thanks go to: 
Peter Zijlstra, Paul Mackerras, Robert Richter, Arnaldo Carvalho de 
Melo, Mike Galbraith, Thomas Gleixner, Wu Fengguang, Jaswinder Singh 
Rajput, Yong Wang, Frederic Weisbecker, Yinghai Lu, Luis Henriques, 
Eric Paris, Arjan van de Ven, Tim Blechmann, Steven Whitehouse, 
Jaswinder Singh, H. Peter Anvin, Hidetoshi Seto, Erdem Aktas and 
Andrew Morton.

The biggest change in -v8 is a re-focusig of our effort towards 
building tools to help various user-space development workflows. The 
latest code and perfcounter-tools deal with all sorts of user-space 
profiling usage models, they are very fast and are able to look up 
DSO symbols regardless of where they are loaded - and try to be easy 
to use and easy to configure.

Per-application and system-wide profiling modes are supported - plus 
a number of intermediate modes are supported as well via the use of 
inherited counters that traverse into child-task hierarchies 
automatically and transparently.

With perfcounters there is no daemon needed: if a perfcounters 
kernel is booted on a supported CPU (all AMD models and Core2 / 
Corei7 / Atom Intel CPUs - both 64-bit and 32-bit user-space is 
supported) then profiling can be done straight away.

Profiling sessions are recorded into local files, which can then be 
analyzed. There's a number of high-level-overview tools 'perf stat' 
and 'perf top' which help one get a quick impression about what to 
profile and in what way.

New in -v8 is the 'perf' utility which has merged all the 
perfcounters utilities and which exposes all the functionality of 
the kernel subsystem, in one uniform and unified way:

 mercury:~/tip/tools/perf> perf

 usage: perf [--version] [--help] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate   Read perf.data (created by perf record) and display annotated code
   list       List all symbolic event types
   record     Run a command and record its profile into perf.data
   report     Read perf.data (created by perf record) and display the profile
   stat       Run a command and gather performance counter statistics
   top        Run a command and profile it

 See 'perf help COMMAND' for more information on a specific command.

There's also a new "record + report" separated profilig workflow 
supported: use "perf record ./my-app" to record its profile, then 
use "perf report" and all its --sort options to get various 
high-level and low level details. Oprofile users will find this 
workflow familar.

On the lowest level, 'perf annotate' will annotate the source code 
alongside profiling information and assembly code:

 $ perf annotate decode_tree_entry

------------------------------------------------
 Percent |	Source code & Disassembly of /home/mingo/git/git
------------------------------------------------
         :
         :	/home/mingo/git/git:     file format elf64-x86-64
         :
         :
         :	Disassembly of section .text:
         :
         :	00000000004a0da0 <decode_tree_entry>:
         :		*modep = mode;
         :		return str;
         :	}
         :
         :	static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size)
         :	{
    3.82 :	  4a0da0:	41 54                	push   %r12
         :		const char *path;
         :		unsigned int mode, len;
         :
         :		if (size < 24 || buf[size - 21])
    0.17 :	  4a0da2:	48 83 fa 17          	cmp    $0x17,%rdx
         :		*modep = mode;
         :		return str;
         :	}
         :
         :	static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size)
         :	{
    0.00 :	  4a0da6:	49 89 fc             	mov    %rdi,%r12
    0.00 :	  4a0da9:	55                   	push   %rbp
    3.37 :	  4a0daa:	53                   	push   %rbx
         :		const char *path;
         :		unsigned int mode, len;
         :
         :		if (size < 24 || buf[size - 21])
    0.08 :	  4a0dab:	76 73                	jbe    4a0e20 <decode_tree_entry+0x80>
    0.00 :	  4a0dad:	80 7c 16 eb 00       	cmpb   $0x0,-0x15(%rsi,%rdx,1)
    3.48 :	  4a0db2:	75 6c                	jne    4a0e20 <decode_tree_entry+0x80>
         :	static const char *get_mode(const char *str, unsigned int *modep)
         :	{
         :		unsigned char c;
         :		unsigned int mode = 0;
         :
         :		if (*str == ' ')
    1.94 :	  4a0db4:	0f b6 06             	movzbl (%rsi),%eax
    0.39 :	  4a0db7:	3c 20                	cmp    $0x20,%al
    0.00 :	  4a0db9:	74 65                	je     4a0e20 <decode_tree_entry+0x80>
         :			return NULL;
         :
         :		while ((c = *str++) != ' ') {
    0.06 :	  4a0dbb:	89 c2                	mov    %eax,%edx
         :			if (c < '0' || c > '7')
    1.99 :	  4a0dbd:	31 ed                	xor    %ebp,%ebp
         :		unsigned int mode = 0;
         :
         :		if (*str == ' ')
         :			return NULL;
         :
         :		while ((c = *str++) != ' ') {
    1.74 :	  4a0dbf:	48 8d 5e 01          	lea    0x1(%rsi),%rbx
         :			if (c < '0' || c > '7')
    0.00 :	  4a0dc3:	8d 42 d0             	lea    -0x30(%rdx),%eax
    0.17 :	  4a0dc6:	3c 07                	cmp    $0x7,%al
    0.00 :	  4a0dc8:	76 0d                	jbe    4a0dd7 <decode_tree_entry+0x37>
    0.00 :	  4a0dca:	eb 54                	jmp    4a0e20 <decode_tree_entry+0x80>
    0.00 :	  4a0dcc:	0f 1f 40 00          	nopl   0x0(%rax)
   16.57 :	  4a0dd0:	8d 42 d0             	lea    -0x30(%rdx),%eax
    0.14 :	  4a0dd3:	3c 07                	cmp    $0x7,%al
    0.00 :	  4a0dd5:	77 49                	ja     4a0e20 <decode_tree_entry+0x80>
         :				return NULL;
         :			mode = (mode << 3) + (c - '0');
    3.12 :	  4a0dd7:	0f b6 c2             	movzbl %dl,%eax
         :		unsigned int mode = 0;
         :
         :		if (*str == ' ')
         :			return NULL;
         :
         :		while ((c = *str++) != ' ') {
    0.00 :	  4a0dda:	0f b6 13             	movzbl (%rbx),%edx
   16.74 :	  4a0ddd:	48 83 c3 01          	add    $0x1,%rbx
         :			if (c < '0' || c > '7')
         :				return NULL;
         :			mode = (mode << 3) + (c - '0');


Those who already use Git will (hopefully) find 'perf' intuitive, as 
we've picked up a number of internal libraries from Git to build 
this tool so the look-and-feel will be familar. It's very 
extensible, new subcommands can be added easily - while there's just 
a single new binary in the system.

'perf report' supports multi-key histograms and a rich set of views 
of the same performance data - per task or per dso, or a finegrained 
per symbol view (and all permutations of these keys).

Most of the user-visible action in -v8 was in the tooling, but the 
kernel side code has been revamped all around as well:

   - Sampling support for inherited counters

   - Performance optimizations to lazy-switch PMU contexts

   - Enhanced PowerPC and x86 support.

   - Generic tracepoints can be used via perfcounters too

   - Fixed-frequency, auto-sampling counters. (they can be used via 
     the '-F' option in perf record and perf top.)

   - Generic "hardware cache" event enumeration method - for those 
     who want more than just a handful of essential hardware
     counters.

   - Automatic "fool-proof" event-throttling code to protect against
     accidentally too short sampling periods.

   - The 'raw events' configuration space has been extended -
     every event type that oprofile is able to handle can be 
     specified via raw perfcounter events as well.

   - ... and lots of other changes.

To try/test/check this code, the latest perfcounters tree can be 
pulled/cloned from:

   git pull \ 
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git \
		perfcounters/core

Or the following patch can be applied to the latest 
(v2.6.30-rc8-git3) upstream -git Linux kernel:

   http://redhat.com/~mingo/perfcounters/perfcounters-v8-v2....

The 'perf' utility can be built by pulling that tree and by doing:

  cd tools/perf/
  make
  make install

( The combo patch is too large to be posted to lkml - and all the 
  v7->v8 patches have been posted to lkml already. )

As usual, test feedback, patche, comments and suggestions are 
welcome!

	Ingo

[linuxkernelnewbies] Performance Counters for Linux, v8 [LWN.net]

Reply via email to