6. Callgrind: a call graph profiler

Table of Contents

6.1. Overview

6.1.1. Functionality
6.1.2. Basic Usage

6.2. Advanced Usage

6.2.1. Multiple profiling dumps from one program run
6.2.2. Limiting the range of collected events
6.2.3. Avoiding cycles
6.2.4. Forking Programs

6.3. Command line option reference

6.3.1. Miscellaneous options
6.3.2. Dump creation options
6.3.3. Activity options
6.3.4. Data collection options
6.3.5. Cost entity separation options
6.3.6. Cache simulation options

6.4. Callgrind specific client requests

6.1. Overview

Callgrind is a profiling tool that can construct a call graph for a program's run. By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls. Optionally, a cache simulator (similar to cachegrind) can produce further information about the memory access behavior of the application.

The profile data is written out to a file at program termination. For presentation of the data, and interactive control of the profiling, two command line tools are provided:

callgrind_annotate

This command reads in the profile data, and prints a sorted lists of functions, optionally with source annotation.

For graphical visualization of the data, try KCachegrind, which is a KDE/Qt based GUI that makes it easy to navigate the large amount of data that Callgrind produces.

callgrind_control

This command enables you to interactively observe and control the status of currently running applications, without stopping the application. You can get statistics information as well as the current stack trace, and you can request zeroing of counters or dumping of profile data.

To use Callgrind, you must specify --tool=callgrind on the Valgrind command line.

6.1.1. Functionality

Cachegrind collects flat profile data: event counts (data reads, cache misses, etc.) are attributed directly to the function they occurred in. This cost attribution mechanism is called self or exclusive attribution.

Callgrind extends this functionality by propagating costs across function call boundaries. If function foo calls bar, the costs from bar are added into foo's costs. When applied to the program as a whole, this builds up a picture of so called inclusive costs, that is, where the cost of each function includes the costs of all functions it called, directly or indirectly.

As an example, the inclusive cost of main should be almost 100 percent of the total program cost. Because of costs arising before main is run, such as initialization of the run time linker and construction of global C++ objects, the inclusive cost of main is not exactly 100 percent of the total program cost.

Together with the call graph, this allows you to find the specific call chains starting from main in which the majority of the program's costs occur. Caller/callee cost attribution is also useful for profiling functions called from multiple call sites, and where optimization opportunities depend on changing code in the callers, in particular by reducing the call count.

Callgrind's cache simulation is based on the Cachegrind tool. Read Cachegrind's documentation first. The material below describes the features supported in addition to Cachegrind's features.

Callgrind's ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC code. This is because there are no explicit call or return instructions in the PowerPC instruction set, so Callgrind has to rely on heuristics to detect calls and returns.

6.1.2. Basic Usage

As with Cachegrind, you probably want to compile with debugging info (the -g flag), but with optimization turned on.

To start a profile run for a program, execute:

valgrind --tool=callgrind [callgrind options] your-program [program options]

While the simulation is running, you can observe execution with

callgrind_control -b

This will print out the current backtrace. To annotate the backtrace with event counts, run

callgrind_control -e -b

After program termination, a profile data file named callgrind.out.<pid> is generated, where pid is the process ID of the program being profiled. The data file contains information about the calls made in the program among the functions executed, together with events of type Instruction Read Accesses (Ir).

To generate a function-by-function summary from the profile data file, use

callgrind_annotate [options] callgrind.out.<pid>

This summary is similar to the output you get from a Cachegrind run with cg_annotate: the list of functions is ordered by exclusive cost of functions, which also are the ones that are shown. Important for the additional features of Callgrind are the following two options:

--inclusive=yes: Instead of using exclusive cost of functions as sorting order, use and show inclusive cost.
--tree=both: Interleave into the top level list of functions, information on the callers and the callees of each function. In these lines, which represents executed calls, the cost gives the number of events spent in the call. Indented, above each function, there is the list of callers, and below, the list of callees. The sum of events in calls to a given function (caller lines), as well as the sum of events in calls from the function (callee lines) together with the self cost, gives the total inclusive cost of the function.

Use --auto=yes to get annotated source code for all relevant functions for which the source can be found. In addition to source annotation as produced by cg_annotate, you will see the annotated call sites with call counts. For all other options, consult the (Cachegrind) documentation for cg_annotate.

For better call graph browsing experience, it is highly recommended to use KCachegrind. If your code has a significant fraction of its cost in cycles (sets of functions calling each other in a recursive manner), you have to use KCachegrind, as callgrind_annotate currently does not do any cycle detection, which is important to get correct results in this case.

If you are additionally interested in measuring the cache behavior of your program, use Callgrind with the option --simulate-cache=yes. However, expect a further slow down approximately by a factor of 2.

If the program section you want to profile is somewhere in the middle of the run, it is beneficial to fast forward to this section without any profiling, and then switch on profiling. This is achieved by using the command line option --instr-atstart=no and running, in a shell, callgrind_control -i on just before the interesting code section is executed. To exactly specify the code position where profiling should start, use the client request CALLGRIND_START_INSTRUMENTATION.

If you want to be able to see assembly code level annotation, specify --dump-instr=yes. This will produce profile data at instruction granularity. Note that the resulting profile data can only be viewed with KCachegrind. For assembly annotation, it also is interesting to see more details of the control flow inside of functions, ie. (conditional) jumps. This will be collected by further specifying --collect-jumps=yes.

6.2. Advanced Usage

6.2.1. Multiple profiling dumps from one program run

Sometimes you are not interested in characteristics of a full program run, but only of a small part of it, for example execution of one algorithm. If there are multiple algorithms, or one algorithm running with different input data, it may even be useful to get different profile information for different parts of a single program run.

Profile data files have names of the form

callgrind.out.pid.part-threadID

where pid is the PID of the running program, part is a number incremented on each dump (".part" is skipped for the dump at program termination), and threadID is a thread identification ("-threadID" is only used if you request dumps of individual threads with --separate-threads=yes).

There are different ways to generate multiple profile dumps while a program is running under Callgrind's supervision. Nevertheless, all methods trigger the same action, which is "dump all profile information since the last dump or program start, and zero cost counters afterwards". To allow for zeroing cost counters without dumping, there is a second action "zero all cost counters now". The different methods are:

Dump on program termination. This method is the standard way and doesn't need any special action on your part.
Spontaneous, interactive dumping. Use
```
callgrind_control -d [hint [PID/Name]]
```
to request the dumping of profile information of the supervised application with PID or Name. hint is an arbitrary string you can optionally specify to later be able to distinguish profile dumps. The control program will not terminate before the dump is completely written. Note that the application must be actively running for detection of the dump command. So, for a GUI application, resize the window, or for a server, send a request.

If you are using KCachegrind for browsing of profile information, you can use the toolbar button Force dump. This will request a dump and trigger a reload after the dump is written.
Periodic dumping after execution of a specified number of basic blocks. For this, use the command line option --dump-every-bb=count.
Dumping at enter/leave of specified functions. Use the option --dump-before=function and --dump-after=function. To zero cost counters before entering a function, use --zero-before=function.

You can specify these options multiple times for different functions. Function specifications support wildcards: eg. use --dump-before='foo*' to generate dumps before entering any function starting with foo.
Program controlled dumping. Insert CALLGRIND_DUMP_STATS; at the position in your code where you want a profile dump to happen. Use CALLGRIND_ZERO_STATS; to only zero profile counters. See Client request reference for more information on Callgrind specific client requests.

If you are running a multi-threaded application and specify the command line option --separate-threads=yes, every thread will be profiled on its own and will create its own profile dump. Thus, the last two methods will only generate one dump of the currently running thread. With the other methods, you will get multiple dumps (one for each thread) on a dump request.

6.2.2. Limiting the range of collected events

For aggregating events (function enter/leave, instruction execution, memory access) into event numbers, first, the events must be recognizable by Callgrind, and second, the collection state must be switched on.

Event collection is only possible if instrumentation for program code is switched on. This is the default, but for faster execution (identical to valgrind --tool=none), it can be switched off until the program reaches a state in which you want to start collecting profiling data. Callgrind can start without instrumentation by specifying option --instr-atstart=no. Instrumentation can be switched on interactively with

callgrind_control -i on

and off by specifying "off" instead of "on". Furthermore, instrumentation state can be programatically changed with the macros CALLGRIND_START_INSTRUMENTATION; and CALLGRIND_STOP_INSTRUMENTATION;.

In addition to enabling instrumentation, you must also enable event collection for the parts of your program you are interested in. By default, event collection is enabled everywhere. You can limit collection to a specific function by using --toggle-collect=function. This will toggle the collection state on entering and leaving the specified functions. When this option is in effect, the default collection state at program start is "off". Only events happening while running inside of the given function will be collected. Recursive calls of the given function do not trigger any action.

It is important to note that with instrumentation switched off, the cache simulator cannot see any memory access events, and thus, any simulated cache state will be frozen and wrong without instrumentation. Therefore, to get useful cache events (hits/misses) after switching on instrumentation, the cache first must warm up, probably leading to many cold misses which would not have happened in reality. If you do not want to see these, start event collection a few million instructions after you have switched on instrumentation.

6.2.3. Avoiding cycles

Informally speaking, a cycle is a group of functions which call each other in a recursive way.

Formally speaking, a cycle is a nonempty set S of functions, such that for every pair of functions F and G in S, it is possible to call from F to G (possibly via intermediate functions) and also from G to F. Furthermore, S must be maximal -- that is, be the largest set of functions satisfying this property. For example, if a third function H is called from inside S and calls back into S, then H is also part of the cycle and should be included in S.

Recursion is quite usual in programs, and therefore, cycles sometimes appear in the call graph output of Callgrind. However, the title of this chapter should raise two questions: What is bad about cycles which makes you want to avoid them? And: How can cycles be avoided without changing program code?

Cycles are not bad in itself, but tend to make performance analysis of your code harder. This is because inclusive costs for calls inside of a cycle are meaningless. The definition of inclusive cost, ie. self cost of a function plus inclusive cost of its callees, needs a topological order among functions. For cycles, this does not hold true: callees of a function in a cycle include the function itself. Therefore, KCachegrind does cycle detection and skips visualization of any inclusive cost for calls inside of cycles. Further, all functions in a cycle are collapsed into artifical functions called like Cycle 1.

Now, when a program exposes really big cycles (as is true for some GUI code, or in general code using event or callback based programming style), you loose the nice property to let you pinpoint the bottlenecks by following call chains from main(), guided via inclusive cost. In addition, KCachegrind looses its ability to show interesting parts of the call graph, as it uses inclusive costs to cut off uninteresting areas.

Despite the meaningless of inclusive costs in cycles, the big drawback for visualization motivates the possibility to temporarily switch off cycle detection in KCachegrind, which can lead to misguiding visualization. However, often cycles appear because of unlucky superposition of independent call chains in a way that the profile result will see a cycle. Neglecting uninteresting calls with very small measured inclusive cost would break these cycles. In such cases, incorrect handling of cycles by not detecting them still gives meaningful profiling visualization.

It has to be noted that currently, callgrind_annotate does not do any cycle detection at all. For program executions with function recursion, it e.g. can print nonsense inclusive costs way above 100%.

After describing why cycles are bad for profiling, it is worth talking about cycle avoidance. The key insight here is that symbols in the profile data do not have to exactly match the symbols found in the program. Instead, the symbol name could encode additional information from the current execution context such as recursion level of the current function, or even some part of the call chain leading to the function. While encoding of additional information into symbols is quite capable of avoiding cycles, it has to be used carefully to not cause symbol explosion. The latter imposes large memory requirement for Callgrind with possible out-of-memory conditions, and big profile data files.

A further possibility to avoid cycles in Callgrind's profile data output is to simply leave out given functions in the call graph. Of course, this also skips any call information from and to an ignored function, and thus can break a cycle. Candidates for this typically are dispatcher functions in event driven code. The option to ignore calls to a function is --fn-skip=function. Aside from possibly breaking cycles, this is used in Callgrind to skip trampoline functions in the PLT sections for calls to functions in shared libraries. You can see the difference if you profile with --skip-plt=no. If a call is ignored, its cost events will be propagated to the enclosing function.

If you have a recursive function, you can distinguish the first 10 recursion levels by specifying --separate-recs10=function. Or for all functions with --separate-recs=10, but this will give you much bigger profile data files. In the profile data, you will see the recursion levels of "func" as the different functions with names "func", "func'2", "func'3" and so on.

If you have call chains "A > B > C" and "A > C > B" in your program, you usually get a "false" cycle "B <> C". Use --separate-callers2=B --separate-callers2=C, and functions "B" and "C" will be treated as different functions depending on the direct caller. Using the apostrophe for appending this "context" to the function name, you get "A > B'A > C'B" and "A > C'A > B'C", and there will be no cycle. Use --separate-callers=2 to get a 2-caller dependency for all functions. Note that doing this will increase the size of profile data files.

6.2.4. Forking Programs

If your program forks, the child will inherit all the profiling data that has been gathered for the parent. To start with empty profile counter values in the child, the client request CALLGRIND_ZERO_STATS;fork(). can be inserted into code to be executed by the child, directly after

However, you will have to make sure that the output file format string (controlled by --callgrind-out-file) does contain %p (which is true by default). Otherwise, the outputs from the parent and child will overwrite each other or will be intermingled, which almost certainly is not what you want.

You will be able to control the new child independently from the parent via callgrind_control.

6.3. Command line option reference

In the following, options are grouped into classes, in the same order as the output of callgrind --help.

Some options allow the specification of a function/symbol name, such as --dump-before=function, or --fn-skip=function. All these options can be specified multiple times for different functions. In addition, the function specifications actually are patterns by supporting the use of wildcards '*' (zero or more arbitrary characters) and '?' (exactly one arbitrary character), similar to file name globbing in the shell. This feature is important especially for C++, as without wildcard usage, the function would have to be specified in full extent, including parameter signature.

6.3.1. Miscellaneous options

--help: Show summary of options. This is a short version of this manual section.
--version: Show version of callgrind.

6.3.2. Dump creation options

These options influence the name and format of the profile data files.

--callgrind-out-file=<file>: Write the profile data to file rather than to the default output file, callgrind.out.<pid>. The %p and %q format specifiers can be used to embed the process ID and/or the contents of an environment variable in the name, as is the case for the core option --log-file. See here for details. When multiple dumps are made, the file name is modified further; see below.
--dump-instr=<no|yes> [default: no]: This specifies that event counting should be performed at per-instruction granularity. This allows for assembly code annotation. Currently the results can only be displayed by KCachegrind.
--dump-line=<no|yes> [default: yes]: This specifies that event counting should be performed at source line granularity. This allows source annotation for sources which are compiled with debug information ("-g").
--compress-strings=<no|yes> [default: yes]: This option influences the output format of the profile data. It specifies whether strings (file and function names) should be identified by numbers. This shrinks the file, but makes it more difficult for humans to read (which is not recommended in any case).
--compress-pos=<no|yes> [default: yes]: This option influences the output format of the profile data. It specifies whether numerical positions are always specified as absolute values or are allowed to be relative to previous numbers. This shrinks the file size,
--combine-dumps=<no|yes> [default: no]: When multiple profile data parts are to be generated, these parts are appended to the same output file if this option is set to "yes". Not recommended.

6.3.3. Activity options

These options specify when actions relating to event counts are to be executed. For interactive control use callgrind_control.

--dump-every-bb=<count> [default: 0, never]: Dump profile data every <count> basic blocks. Whether a dump is needed is only checked when Valgrind's internal scheduler is run. Therefore, the minimum setting useful is about 100000. The count is a 64-bit value to make long dump periods possible.
--dump-before=<function>: Dump when entering <function>
--zero-before=<function>: Zero all costs when entering <function>
--dump-after=<function>: Dump when leaving <function>

6.3.4. Data collection options

These options specify when events are to be aggregated into event counts. Also see Limiting range of event collection.

--instr-atstart=<yes|no> [default: yes]

Specify if you want Callgrind to start simulation and profiling from the beginning of the program. When set to no, Callgrind will not be able to collect any information, including calls, but it will have at most a slowdown of around 4, which is the minimum Valgrind overhead. Instrumentation can be interactively switched on via callgrind_control -i on.

Note that the resulting call graph will most probably not contain main, but will contain all the functions executed after instrumentation was switched on. Instrumentation can also programatically switched on/off. See the Callgrind include file <callgrind.h> for the macro you have to use in your source code.

For cache simulation, results will be less accurate when switching on instrumentation later in the program run, as the simulator starts with an empty cache at that moment. Switch on event collection later to cope with this error.

--collect-atstart=<yes|no> [default: yes]

Specify whether event collection is switched on at beginning of the profile run.

To only look at parts of your program, you have two possibilities:

Zero event counters before entering the program part you want to profile, and dump the event counters to a file after leaving that program part.
Switch on/off collection state as needed to only see event counters happening while inside of the program part you want to profile.

The second option can be used if the program part you want to profile is called many times. Option 1, i.e. creating a lot of dumps is not practical here.

Collection state can be toggled at entry and exit of a given function with the option --toggle-collect. If you use this flag, collection state should be switched off at the beginning. Note that the specification of --toggle-collect implicitly sets --collect-state=no.

Collection state can be toggled also by inserting the client request CALLGRIND_TOGGLE_COLLECT; at the needed code positions.

--toggle-collect=<function>

Toggle collection on entry/exit of <function>.

--collect-jumps=<no|yes> [default: no]

This specifies whether information for (conditional) jumps should be collected. As above, callgrind_annotate currently is not able to show you the data. You have to use KCachegrind to get jump arrows in the annotated code.

6.3.5. Cost entity separation options

These options specify how event counts should be attributed to execution contexts. For example, they specify whether the recursion level or the call chain leading to a function should be taken into account, and whether the thread ID should be considered. Also see Avoiding cycles.

--separate-threads=<no|yes> [default: no]

This option specifies whether profile data should be generated separately for every thread. If yes, the file names get "-threadID" appended.

--separate-recs=<level> [default: 2]

Separate function recursions by at most <level> levels. See Avoiding cycles.

--separate-callers=<callers> [default: 0]

Separate contexts by at most <callers> functions in the call chain. See Avoiding cycles.

--skip-plt=<no|yes> [default: yes]

Ignore calls to/from PLT sections.

--fn-skip=<function>

Ignore calls to/from a given function. E.g. if you have a call chain A > B > C, and you specify function B to be ignored, you will only see A > C.

This is very convenient to skip functions handling callback behaviour. For example, with the signal/slot mechanism in the Qt graphics library, you only want to see the function emitting a signal to call the slots connected to that signal. First, determine the real call chain to see the functions needed to be skipped, then use this option.

--fn-group<number>=<function>

Put a function into a separate group. This influences the context name for cycle avoidance. All functions inside such a group are treated as being the same for context name building, which resembles the call chain leading to a context. By specifying function groups with this option, you can shorten the context name, as functions in the same group will not appear in sequence in the name.

--separate-recs<number>=<function>

Separate <number> recursions for <function>. See Avoiding cycles.

--separate-callers<number>=<function>

Separate <number> callers for <function>. See Avoiding cycles.

6.3.6. Cache simulation options

--simulate-cache=<yes|no> [default: no]: Specify if you want to do full cache simulation. By default, only instruction read accesses will be profiled.
--simulate-hwpref=<yes|no> [default: no]: Specify whether simulation of a hardware prefetcher should be added which is able to detect stream access in the second level cache by comparing accesses to separate to each page. As the simulation can not decide about any timing issues of prefetching, it is assumed that any hardware prefetch triggered succeeds before a real access is done. Thus, this gives a best-case scenario by covering all possible stream accesses.

6.4. Callgrind specific client requests

In Valgrind terminology, a client request is a C macro which can be inserted into your code to request specific functionality when run under Valgrind. For this, special instruction patterns resulting in NOPs are used, but which can be detected by Valgrind.

Callgrind provides the following specific client requests. To use them, add the line

#include <valgrind/callgrind.h>

into your code for the macro definitions. .

CALLGRIND_DUMP_STATS: Force generation of a profile dump at specified position in code, for the current thread only. Written counters will be reset to zero.
CALLGRIND_DUMP_STATS_AT(string): Same as CALLGRIND_DUMP_STATS, but allows to specify a string to be able to distinguish profile dumps.
CALLGRIND_ZERO_STATS: Reset the profile counters for the current thread to zero.
CALLGRIND_TOGGLE_COLLECT: Toggle the collection state. This allows to ignore events with regard to profile counters. See also options --collect-atstart and --toggle-collect.
CALLGRIND_START_INSTRUMENTATION: Start full Callgrind instrumentation if not already switched on. When cache simulation is done, this will flush the simulated cache and lead to an artifical cache warmup phase afterwards with cache misses which would not have happened in reality. See also option --instr-atstart.
CALLGRIND_STOP_INSTRUMENTATION: Stop full Callgrind instrumentation if not already switched off. This flushes Valgrinds translation cache, and does no additional instrumentation afterwards: it effectivly will run at the same speed as the "none" tool, ie. at minimal slowdown. Use this to speed up the Callgrind run for uninteresting code parts. Use CALLGRIND_START_INSTRUMENTATION to switch on instrumentation again. See also option --instr-atstart.

[linuxkernelnewbies] Callgrind - Valgrind