http://lwn.net/Articles/336542/We are pleased to announce version 8 of the performance counters subsystem for Linux. This new subsystem adds a new system call (sys_perf_counter_open()) and it provides the new 'perf' tool that makes use of these new kernel capabilities. This subsystem and this tool is new in that it tries a new approach at integrating all things performance analysis under one roof. There have been many changes since -v7 - see the shortlog below for details. There are a lot of new contributors to this code. Many thanks go to: Peter Zijlstra, Paul Mackerras, Robert Richter, Arnaldo Carvalho de Melo, Mike Galbraith, Thomas Gleixner, Wu Fengguang, Jaswinder Singh Rajput, Yong Wang, Frederic Weisbecker, Yinghai Lu, Luis Henriques, Eric Paris, Arjan van de Ven, Tim Blechmann, Steven Whitehouse, Jaswinder Singh, H. Peter Anvin, Hidetoshi Seto, Erdem Aktas and Andrew Morton. The biggest change in -v8 is a re-focusig of our effort towards building tools to help various user-space development workflows. The latest code and perfcounter-tools deal with all sorts of user-space profiling usage models, they are very fast and are able to look up DSO symbols regardless of where they are loaded - and try to be easy to use and easy to configure. Per-application and system-wide profiling modes are supported - plus a number of intermediate modes are supported as well via the use of inherited counters that traverse into child-task hierarchies automatically and transparently. With perfcounters there is no daemon needed: if a perfcounters kernel is booted on a supported CPU (all AMD models and Core2 / Corei7 / Atom Intel CPUs - both 64-bit and 32-bit user-space is supported) then profiling can be done straight away. Profiling sessions are recorded into local files, which can then be analyzed. There's a number of high-level-overview tools 'perf stat' and 'perf top' which help one get a quick impression about what to profile and in what way. New in -v8 is the 'perf' utility which has merged all the perfcounters utilities and which exposes all the functionality of the kernel subsystem, in one uniform and unified way: mercury:~/tip/tools/perf> perf usage: perf [--version] [--help] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code list List all symbolic event types record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile stat Run a command and gather performance counter statistics top Run a command and profile it See 'perf help COMMAND' for more information on a specific command. There's also a new "record + report" separated profilig workflow supported: use "perf record ./my-app" to record its profile, then use "perf report" and all its --sort options to get various high-level and low level details. Oprofile users will find this workflow familar. On the lowest level, 'perf annotate' will annotate the source code alongside profiling information and assembly code: $ perf annotate decode_tree_entry ------------------------------------------------ Percent | Source code & Disassembly of /home/mingo/git/git ------------------------------------------------ : : /home/mingo/git/git: file format elf64-x86-64 : : : Disassembly of section .text: : : 00000000004a0da0 <decode_tree_entry>: : *modep = mode; : return str; : } : : static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size) : { 3.82 : 4a0da0: 41 54 push %r12 : const char *path; : unsigned int mode, len; : : if (size < 24 || buf[size - 21]) 0.17 : 4a0da2: 48 83 fa 17 cmp $0x17,%rdx : *modep = mode; : return str; : } : : static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size) : { 0.00 : 4a0da6: 49 89 fc mov %rdi,%r12 0.00 : 4a0da9: 55 push %rbp 3.37 : 4a0daa: 53 push %rbx : const char *path; : unsigned int mode, len; : : if (size < 24 || buf[size - 21]) 0.08 : 4a0dab: 76 73 jbe 4a0e20 <decode_tree_entry+0x80> 0.00 : 4a0dad: 80 7c 16 eb 00 cmpb $0x0,-0x15(%rsi,%rdx,1) 3.48 : 4a0db2: 75 6c jne 4a0e20 <decode_tree_entry+0x80> : static const char *get_mode(const char *str, unsigned int *modep) : { : unsigned char c; : unsigned int mode = 0; : : if (*str == ' ') 1.94 : 4a0db4: 0f b6 06 movzbl (%rsi),%eax 0.39 : 4a0db7: 3c 20 cmp $0x20,%al 0.00 : 4a0db9: 74 65 je 4a0e20 <decode_tree_entry+0x80> : return NULL; : : while ((c = *str++) != ' ') { 0.06 : 4a0dbb: 89 c2 mov %eax,%edx : if (c < '0' || c > '7') 1.99 : 4a0dbd: 31 ed xor %ebp,%ebp : unsigned int mode = 0; : : if (*str == ' ') : return NULL; : : while ((c = *str++) != ' ') { 1.74 : 4a0dbf: 48 8d 5e 01 lea 0x1(%rsi),%rbx : if (c < '0' || c > '7') 0.00 : 4a0dc3: 8d 42 d0 lea -0x30(%rdx),%eax 0.17 : 4a0dc6: 3c 07 cmp $0x7,%al 0.00 : 4a0dc8: 76 0d jbe 4a0dd7 <decode_tree_entry+0x37> 0.00 : 4a0dca: eb 54 jmp 4a0e20 <decode_tree_entry+0x80> 0.00 : 4a0dcc: 0f 1f 40 00 nopl 0x0(%rax) 16.57 : 4a0dd0: 8d 42 d0 lea -0x30(%rdx),%eax 0.14 : 4a0dd3: 3c 07 cmp $0x7,%al 0.00 : 4a0dd5: 77 49 ja 4a0e20 <decode_tree_entry+0x80> : return NULL; : mode = (mode << 3) + (c - '0'); 3.12 : 4a0dd7: 0f b6 c2 movzbl %dl,%eax : unsigned int mode = 0; : : if (*str == ' ') : return NULL; : : while ((c = *str++) != ' ') { 0.00 : 4a0dda: 0f b6 13 movzbl (%rbx),%edx 16.74 : 4a0ddd: 48 83 c3 01 add $0x1,%rbx : if (c < '0' || c > '7') : return NULL; : mode = (mode << 3) + (c - '0'); Those who already use Git will (hopefully) find 'perf' intuitive, as we've picked up a number of internal libraries from Git to build this tool so the look-and-feel will be familar. It's very extensible, new subcommands can be added easily - while there's just a single new binary in the system. 'perf report' supports multi-key histograms and a rich set of views of the same performance data - per task or per dso, or a finegrained per symbol view (and all permutations of these keys). Most of the user-visible action in -v8 was in the tooling, but the kernel side code has been revamped all around as well: - Sampling support for inherited counters - Performance optimizations to lazy-switch PMU contexts - Enhanced PowerPC and x86 support. - Generic tracepoints can be used via perfcounters too - Fixed-frequency, auto-sampling counters. (they can be used via the '-F' option in perf record and perf top.) - Generic "hardware cache" event enumeration method - for those who want more than just a handful of essential hardware counters. - Automatic "fool-proof" event-throttling code to protect against accidentally too short sampling periods. - The 'raw events' configuration space has been extended - every event type that oprofile is able to handle can be specified via raw perfcounter events as well. - ... and lots of other changes. To try/test/check this code, the latest perfcounters tree can be pulled/cloned from: git pull \ git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git \ perfcounters/core Or the following patch can be applied to the latest (v2.6.30-rc8-git3) upstream -git Linux kernel: http://redhat.com/~mingo/perfcounters/perfcounters-v8-v2.... The 'perf' utility can be built by pulling that tree and by doing: cd tools/perf/ make make install ( The combo patch is too large to be posted to lkml - and all the v7->v8 patches have been posted to lkml already. ) As usual, test feedback, patche, comments and suggestions are welcome! Ingo |