http://valgrind.org/docs/manual/cl-manual.html
6. Callgrind:
a call graph profiler
Callgrind is a profiling
tool that can
construct a call graph for a program's run.
By default, the collected data consists of
the number of instructions executed, their relationship
to source lines, the caller/callee relationship between functions,
and the numbers of such calls.
Optionally, a cache simulator (similar to cachegrind) can produce
further information about the memory access behavior of the
application.
The profile data is
written out to a file at program
termination. For presentation of the data, and interactive control
of the profiling, two command line tools are provided:
- callgrind_annotate
-
This command reads in
the profile data, and prints a sorted lists of functions, optionally
with source annotation.
For graphical
visualization of the data, try KCachegrind, which is a KDE/Qt based GUI that makes
it easy to navigate the large amount of data that Callgrind produces.
- callgrind_control
-
This command enables
you to interactively observe and control the status of currently
running applications, without stopping the application. You can get
statistics information as well as the current stack trace, and you can
request zeroing of counters or dumping of profile data.
To use Callgrind, you must
specify --tool=callgrind on the
Valgrind command line.
Cachegrind collects flat
profile data: event counts (data reads,
cache misses, etc.) are attributed directly to the function they
occurred in. This cost attribution mechanism is
called self or exclusive
attribution.
Callgrind extends this
functionality by propagating costs
across function call boundaries. If function foo
calls
bar , the costs from bar
are added into
foo 's costs. When applied to the program as a
whole,
this builds up a picture of so called inclusive
costs, that is, where the cost of each function includes the costs of
all functions it called, directly or indirectly.
As an example, the
inclusive cost of
main should be almost 100 percent
of the total program cost. Because of costs arising before main is run, such as
initialization of the run time linker and construction of global C++
objects, the inclusive cost of main
is not exactly 100 percent of the total program cost.
Together with the call
graph, this allows you to find the
specific call chains starting from
main in which the majority of the
program's costs occur. Caller/callee cost attribution is also useful
for profiling functions called from multiple call sites, and where
optimization opportunities depend on changing code in the callers, in
particular by reducing the call count.
Callgrind's cache
simulation is based on the Cachegrind
tool. Read Cachegrind's
documentation first.
The material below describes the features supported in addition to
Cachegrind's features.
Callgrind's ability to
detect function calls and returns depends
on the instruction set of the platform it is run on. It works best
on x86 and amd64, and unfortunately currently does not work so well
on PowerPC code. This is because there are no explicit call or return
instructions in the PowerPC instruction set, so Callgrind has to rely
on heuristics to detect calls and returns.
As with Cachegrind, you
probably want to compile with debugging info (the -g flag), but with
optimization turned on.
To start a profile run for
a program, execute:
valgrind --tool=callgrind [callgrind options] your-program [program options]
While the simulation is
running, you can observe execution with
callgrind_control -b
This will print out the
current backtrace. To annotate the backtrace with event counts, run
callgrind_control -e -b
After program termination,
a profile data file named callgrind.out.<pid>
is generated, where pid is the
process ID of the program being profiled. The data file contains
information about the calls made in the program among the functions
executed, together with events of type Instruction
Read Accesses (Ir).
To generate a
function-by-function summary from the profile data file, use
callgrind_annotate [options] callgrind.out.<pid>
This summary is similar
to the output you get from a Cachegrind run with cg_annotate : the list of functions is
ordered by exclusive cost of functions, which also are the ones that
are shown. Important for the additional features of Callgrind are the
following two options:
-
--inclusive=yes :
Instead of using exclusive cost of functions as sorting order, use and
show inclusive cost.
-
--tree=both :
Interleave into the top level list of functions, information on the
callers and the callees of each function. In these lines, which
represents executed calls, the cost gives the number of events spent in
the call. Indented, above each function, there is the list of callers,
and below, the list of callees. The sum of events in calls to a given
function (caller lines), as well as the sum of events in calls from the
function (callee lines) together with the self cost, gives the total
inclusive cost of the function.
Use --auto=yes
to get annotated source code for all relevant functions for which the
source can be found. In addition to source annotation as produced by cg_annotate , you will see the annotated
call sites with call counts. For all other options, consult the
(Cachegrind) documentation for cg_annotate .
For better call graph
browsing experience, it is highly recommended to use KCachegrind. If your code has a significant fraction
of its cost in cycles (sets of
functions calling each other in a recursive manner), you have to use
KCachegrind, as callgrind_annotate
currently does not do any cycle detection, which is important to get
correct results in this case.
If you are additionally
interested in measuring the cache behavior of your program, use
Callgrind with the option --simulate-cache=yes.
However, expect a further slow down approximately by a factor of 2.
If the program section you
want to profile is somewhere in the middle of the run, it is beneficial
to fast forward to this section
without any profiling, and then switch on profiling. This is achieved
by using the command line option --instr-atstart=no
and running, in a shell, callgrind_control
-i on just before the interesting code section is executed. To
exactly specify the code position where profiling should start, use the
client request CALLGRIND_START_INSTRUMENTATION .
If you want to be able to
see assembly code level annotation, specify --dump-instr=yes .
This will produce profile data at instruction granularity. Note that
the resulting profile data can only be viewed with KCachegrind. For
assembly annotation, it also is interesting to see more details of the
control flow inside of functions, ie. (conditional) jumps. This will be
collected by further specifying --collect-jumps=yes .
6.2.1. Multiple profiling dumps from one
program run
Sometimes you are not
interested in characteristics of a full program run, but only of a
small part of it, for example execution of one algorithm. If there are
multiple algorithms, or one algorithm running with different input
data, it may even be useful to get different profile information for
different parts of a single program run.
Profile data files have
names of the form
callgrind.out.pid.part-threadID
where pid is the PID of the running
program, part is a number
incremented on each dump (".part" is skipped for the dump at program
termination), and threadID is a
thread identification ("-threadID" is only used if you request dumps of
individual threads with --separate-threads=yes ).
There are different ways
to generate multiple profile dumps while a program is running under
Callgrind's supervision. Nevertheless, all methods trigger the same
action, which is "dump all profile information since the last dump or
program start, and zero cost counters afterwards". To allow for zeroing
cost counters without dumping, there is a second action "zero all cost
counters now". The different methods are:
-
Dump on program termination. This
method is the standard way and doesn't need any special action on your
part.
-
Spontaneous, interactive dumping. Use
callgrind_control -d [hint [PID/Name]]
to request the
dumping of profile information of the supervised application with PID
or Name. hint is an arbitrary
string you can optionally specify to later be able to distinguish
profile dumps. The control program will not terminate before the dump
is completely written. Note that the application must be actively
running for detection of the dump command. So, for a GUI application,
resize the window, or for a server, send a request.
If you are using KCachegrind for browsing of profile information, you
can use the toolbar button Force dump.
This will request a dump and trigger a reload after the dump is written.
-
Periodic dumping after execution of a specified number
of basic blocks. For this, use the command line option --dump-every-bb=count .
-
Dumping at enter/leave of specified functions.
Use the option --dump-before=function
and --dump-after=function .
To zero cost counters before entering a function, use --zero-before=function .
You can specify these
options multiple times for different functions. Function specifications
support wildcards: eg. use --dump-before='foo*'
to generate dumps before entering any function starting with foo.
-
Program controlled dumping. Insert CALLGRIND_DUMP_STATS;
at the position in your code where you want a profile dump to happen.
Use CALLGRIND_ZERO_STATS;
to only zero profile counters. See Client
request reference for more information on Callgrind specific client
requests.
If you are running a
multi-threaded application and specify the command line option --separate-threads=yes ,
every thread will be profiled on its own and will create its own
profile dump. Thus, the last two methods will only generate one dump of
the currently running thread. With the other methods, you will get
multiple dumps (one for each thread) on a dump request.
6.2.2. Limiting the range of collected
events
For aggregating events
(function enter/leave, instruction execution, memory access) into event
numbers, first, the events must be recognizable by Callgrind, and
second, the collection state must be switched on.
Event collection is only
possible if instrumentation for
program code is switched on. This is the default, but for faster
execution (identical to valgrind
--tool=none ), it can be switched off until the program reaches a
state in which you want to start collecting profiling data. Callgrind
can start without instrumentation by specifying option --instr-atstart=no .
Instrumentation can be switched on interactively with
callgrind_control -i on
and off by specifying
"off" instead of "on". Furthermore, instrumentation state can be
programatically changed with the macros CALLGRIND_START_INSTRUMENTATION;
and CALLGRIND_STOP_INSTRUMENTATION; .
In addition to enabling
instrumentation, you must also enable event collection for the parts of
your program you are interested in. By default, event collection is
enabled everywhere. You can limit collection to a specific function by
using --toggle-collect=function .
This will toggle the collection state on entering and leaving the
specified functions. When this option is in effect, the default
collection state at program start is "off". Only events happening while
running inside of the given function will be collected. Recursive calls
of the given function do not trigger any action.
It is important to note
that with instrumentation switched off, the cache simulator cannot see
any memory access events, and thus, any simulated cache state will be
frozen and wrong without instrumentation. Therefore, to get useful
cache events (hits/misses) after switching on instrumentation, the
cache first must warm up, probably leading to many cold misses which would not have
happened in reality. If you do not want to see these, start event
collection a few million instructions after you have switched on
instrumentation.
Informally speaking, a
cycle is a group of functions which call each other in a recursive way.
Formally speaking, a cycle
is a nonempty set S of functions, such that for every pair of functions
F and G in S, it is possible to call from F to G (possibly via
intermediate functions) and also from G to F. Furthermore, S must be
maximal -- that is, be the largest set of functions satisfying this
property. For example, if a third function H is called from inside S
and calls back into S, then H is also part of the cycle and should be
included in S.
Recursion is quite usual
in programs, and therefore, cycles sometimes appear in the call graph
output of Callgrind. However, the title of this chapter should raise
two questions: What is bad about cycles which makes you want to avoid
them? And: How can cycles be avoided without changing program code?
Cycles are not bad in
itself, but tend to make performance analysis of your code harder. This
is because inclusive costs for calls inside of a cycle are meaningless.
The definition of inclusive cost, ie. self cost of a function plus
inclusive cost of its callees, needs a topological order among
functions. For cycles, this does not hold true: callees of a function
in a cycle include the function itself. Therefore, KCachegrind does
cycle detection and skips visualization of any inclusive cost for calls
inside of cycles. Further, all functions in a cycle are collapsed into
artifical functions called like Cycle 1 .
Now, when a program
exposes really big cycles (as is true for some GUI code, or in general
code using event or callback based programming style), you loose the
nice property to let you pinpoint the bottlenecks by following call
chains from main() , guided via
inclusive cost. In addition, KCachegrind looses its ability to show
interesting parts of the call graph, as it uses inclusive costs to cut
off uninteresting areas.
Despite the meaningless of
inclusive costs in cycles, the big drawback for visualization motivates
the possibility to temporarily switch off cycle detection in
KCachegrind, which can lead to misguiding visualization. However, often
cycles appear because of unlucky superposition of independent call
chains in a way that the profile result will see a cycle. Neglecting
uninteresting calls with very small measured inclusive cost would break
these cycles. In such cases, incorrect handling of cycles by not
detecting them still gives meaningful profiling visualization.
It has to be noted that
currently, callgrind_annotate
does not do any cycle detection at all. For program executions with
function recursion, it e.g. can print nonsense inclusive costs way
above 100%.
After describing why
cycles are bad for profiling, it is worth talking about cycle
avoidance. The key insight here is that symbols in the profile data do
not have to exactly match the symbols found in the program. Instead,
the symbol name could encode additional information from the current
execution context such as recursion level of the current function, or
even some part of the call chain leading to the function. While
encoding of additional information into symbols is quite capable of
avoiding cycles, it has to be used carefully to not cause symbol
explosion. The latter imposes large memory requirement for Callgrind
with possible out-of-memory conditions, and big profile data files.
A further possibility to
avoid cycles in Callgrind's profile data output is to simply leave out
given functions in the call graph. Of course, this also skips any call
information from and to an ignored function, and thus can break a
cycle. Candidates for this typically are dispatcher functions in event
driven code. The option to ignore calls to a function is --fn-skip=function .
Aside from possibly breaking cycles, this is used in Callgrind to skip
trampoline functions in the PLT sections for calls to functions in
shared libraries. You can see the difference if you profile with --skip-plt=no .
If a call is ignored, its cost events will be propagated to the
enclosing function.
If you have a recursive
function, you can distinguish the first 10 recursion levels by
specifying --separate-recs10=function .
Or for all functions with --separate-recs=10 ,
but this will give you much bigger profile data files. In the profile
data, you will see the recursion levels of "func" as the different
functions with names "func", "func'2", "func'3" and so on.
If you have call chains "A
> B > C" and "A > C > B" in your program, you usually get a
"false" cycle "B <> C". Use --separate-callers2=B
--separate-callers2=C ,
and functions "B" and "C" will be treated as different functions
depending on the direct caller. Using the apostrophe for appending this
"context" to the function name, you get "A > B'A > C'B" and "A
> C'A > B'C", and there will be no cycle. Use --separate-callers=2
to get a 2-caller dependency for all functions. Note that doing this
will increase the size of profile data files.
If your program forks, the
child will inherit all the profiling data that has been gathered for
the parent. To start with empty profile counter values in the child,
the client request CALLGRIND_ZERO_STATS; fork() . can be inserted into code to be
executed by the child, directly after
However, you will have to
make sure that the output file format string (controlled by --callgrind-out-file ) does contain %p (which is true by default). Otherwise, the
outputs from the parent and child will overwrite each other or will be
intermingled, which almost certainly is not what you want.
You will be able to
control the new child independently from the parent via callgrind_control .
6.3. Command
line option reference
In the following, options are grouped into classes, in the same order
as
the output of callgrind --help .
Some options allow the specification of a function/symbol name, such as
--dump-before=function ,
or
--fn-skip=function .
All these options
can be specified multiple times for different functions.
In addition, the function specifications actually are patterns by
supporting
the use of wildcards '*' (zero or more arbitrary characters) and '?'
(exactly one arbitrary character), similar to file name globbing in the
shell. This feature is important especially for C++, as without
wildcard
usage, the function would have to be specified in full extent,
including
parameter signature.
6.3.1. Miscellaneous options
--help
-
Show summary of
options. This is a short version of this manual section.
--version
-
Show version of
callgrind.
6.3.2. Dump creation options
These options influence the name and format of the profile data files.
-
--callgrind-out-file=<file>
-
Write the profile data
to file rather than to the default
output file, callgrind.out.<pid> .
The %p and %q
format specifiers can be used to embed the process ID and/or the
contents of an environment variable in the name, as is the case for the
core option --log-file . See here for details. When multiple dumps
are made, the file name is modified further; see below.
-
--dump-instr=<no|yes>
[default: no]
-
This specifies that
event counting should be performed at per-instruction granularity. This
allows for assembly code annotation. Currently the results can only be
displayed by KCachegrind.
-
--dump-line=<no|yes>
[default: yes]
-
This specifies that
event counting should be performed at source line granularity. This
allows source annotation for sources which are compiled with debug
information ("-g").
-
--compress-strings=<no|yes> [default: yes]
-
This option influences
the output format of the profile data. It specifies whether strings
(file and function names) should be identified by numbers. This shrinks
the file, but makes it more difficult for humans to read (which is not
recommended in any case).
-
--compress-pos=<no|yes>
[default: yes]
-
This option influences
the output format of the profile data. It specifies whether numerical
positions are always specified as absolute values or are allowed to be
relative to previous numbers. This shrinks the file size,
-
--combine-dumps=<no|yes>
[default: no]
-
When multiple profile
data parts are to be generated, these parts are appended to the same
output file if this option is set to "yes". Not recommended.
These options specify when actions relating to event counts are to
be executed. For interactive control use
callgrind_control .
-
--dump-every-bb=<count>
[default: 0, never]
-
Dump profile data
every <count> basic blocks. Whether a dump is needed is only
checked when Valgrind's internal scheduler is run. Therefore, the
minimum setting useful is about 100000. The count is a 64-bit value to
make long dump periods possible.
-
--dump-before=<function>
-
Dump when entering
<function>
-
--zero-before=<function>
-
Zero all costs when
entering <function>
-
--dump-after=<function>
-
Dump when leaving
<function>
6.3.4. Data collection options
These options specify when events are to be aggregated into event
counts.
Also see Limiting
range of event collection.
-
--instr-atstart=<yes|no>
[default: yes]
-
Specify if you want
Callgrind to start simulation and profiling from the beginning of the
program. When set to no , Callgrind
will not be able to collect any information, including calls, but it
will have at most a slowdown of around 4, which is the minimum Valgrind
overhead. Instrumentation can be interactively switched on via callgrind_control -i on .
Note that the
resulting call graph will most probably not contain main , but will contain all the functions
executed after instrumentation was switched on. Instrumentation can
also programatically switched on/off. See the Callgrind include file <callgrind.h> for the macro you
have to use in your source code.
For cache simulation,
results will be less accurate when switching on instrumentation later
in the program run, as the simulator starts with an empty cache at that
moment. Switch on event collection later to cope with this error.
-
--collect-atstart=<yes|no> [default: yes]
-
Specify whether event
collection is switched on at beginning of the profile run.
To only look at parts
of your program, you have two possibilities:
-
Zero event
counters before entering the program part you want to profile, and dump
the event counters to a file after leaving that program part.
-
Switch on/off
collection state as needed to only see event counters happening while
inside of the program part you want to profile.
The second option can
be used if the program part you want to profile is called many times.
Option 1, i.e. creating a lot of dumps is not practical here.
Collection state can
be toggled at entry and exit of a given function with the option --toggle-collect.
If you use this flag, collection state should be switched off at the
beginning. Note that the specification of --toggle-collect
implicitly sets --collect-state=no .
Collection state can
be toggled also by inserting the client request CALLGRIND_TOGGLE_COLLECT;
at the needed code positions.
-
--toggle-collect=<function>
-
Toggle collection on
entry/exit of <function>.
-
--collect-jumps=<no|yes>
[default: no]
-
This specifies whether
information for (conditional) jumps should be collected. As above,
callgrind_annotate currently is not able to show you the data. You have
to use KCachegrind to get jump arrows in the annotated code.
6.3.5. Cost entity separation
options
These options specify how event counts should be attributed to
execution
contexts.
For example, they specify whether the recursion level or the
call chain leading to a function should be taken into account, and
whether the thread ID should be considered.
Also see Avoiding
cycles.
-
--separate-threads=<no|yes> [default: no]
-
This option specifies
whether profile data should be generated separately for every thread.
If yes, the file names get "-threadID" appended.
-
--separate-recs=<level>
[default: 2]
-
Separate function
recursions by at most <level> levels. See Avoiding
cycles.
-
--separate-callers=<callers> [default: 0]
-
Separate contexts by
at most <callers> functions in the call chain. See Avoiding
cycles.
-
--skip-plt=<no|yes>
[default: yes]
-
Ignore calls to/from
PLT sections.
-
--fn-skip=<function>
-
Ignore calls to/from a
given function. E.g. if you have a call chain A > B > C, and you
specify function B to be ignored, you will only see A > C.
This is very
convenient to skip functions handling callback behaviour. For example,
with the signal/slot mechanism in the Qt graphics library, you only
want to see the function emitting a signal to call the slots connected
to that signal. First, determine the real call chain to see the
functions needed to be skipped, then use this option.
-
--fn-group<number>=<function>
-
Put a function into a
separate group. This influences the context name for cycle avoidance.
All functions inside such a group are treated as being the same for
context name building, which resembles the call chain leading to a
context. By specifying function groups with this option, you can
shorten the context name, as functions in the same group will not
appear in sequence in the name.
-
--separate-recs<number>=<function>
-
Separate
<number> recursions for <function>. See Avoiding
cycles.
-
--separate-callers<number>=<function>
-
Separate
<number> callers for <function>. See Avoiding
cycles.
6.3.6. Cache simulation options
-
--simulate-cache=<yes|no>
[default: no]
-
Specify if you want to
do full cache simulation. By default, only instruction read accesses
will be profiled.
-
--simulate-hwpref=<yes|no> [default: no]
-
Specify whether
simulation of a hardware prefetcher should be added which is able to
detect stream access in the second level cache by comparing accesses to
separate to each page. As the simulation can not decide about any
timing issues of prefetching, it is assumed that any hardware prefetch
triggered succeeds before a real access is done. Thus, this gives a
best-case scenario by covering all possible stream accesses.
6.4. Callgrind
specific client requests
In Valgrind terminology, a
client request is a C macro which
can be inserted into your code to request specific functionality when
run under Valgrind. For this, special instruction patterns resulting
in NOPs are used, but which can be detected by Valgrind.
Callgrind provides the
following specific client requests.
To use them, add the line
#include <valgrind/callgrind.h>
into your code for the macro definitions.
.
-
CALLGRIND_DUMP_STATS
-
Force generation of a
profile dump at specified position in code, for the current thread
only. Written counters will be reset to zero.
-
CALLGRIND_DUMP_STATS_AT(string)
-
Same as
CALLGRIND_DUMP_STATS, but allows to specify a string to be able to
distinguish profile dumps.
-
CALLGRIND_ZERO_STATS
-
Reset the profile
counters for the current thread to zero.
-
CALLGRIND_TOGGLE_COLLECT
-
Toggle the collection
state. This allows to ignore events with regard to profile counters.
See also options --collect-atstart
and --toggle-collect.
-
CALLGRIND_START_INSTRUMENTATION
-
Start full Callgrind
instrumentation if not already switched on. When cache simulation is
done, this will flush the simulated cache and lead to an artifical
cache warmup phase afterwards with cache misses which would not have
happened in reality. See also option --instr-atstart.
-
CALLGRIND_STOP_INSTRUMENTATION
-
Stop full Callgrind
instrumentation if not already switched off. This flushes Valgrinds
translation cache, and does no additional instrumentation afterwards:
it effectivly will run at the same speed as the "none" tool, ie. at
minimal slowdown. Use this to speed up the Callgrind run for
uninteresting code parts. Use CALLGRIND_START_INSTRUMENTATION
to switch on instrumentation again. See also option --instr-atstart.
|