Tracing: no shortage of options

By Jonathan Corbet
July 22, 2008

Three weeks ago, LWN looked at the renewed interest in dynamic tracing, with an emphasis on SystemTap. Tracing is a perennial presence on end-user wishlists; it remains a handy tool for companies like Sun Microsystems, which wish to show that their offerings (Solaris, for example) are superior to Linux. It is not surprising that there is a lot of interest in tracing implementations for Linux; the main surprise is that, after all this time, Linux still does not have a top-quality answer to DTrace - though, arguably, Linux had a working tracing mechanism long before DTrace made its appearance.

Even a casual reader of the kernel mailing list will have noticed that there are a lot of tracing-related patches in circulation at the moment. There are so many, in fact, that it is hard to keep track of them all. So this article will take a quick look at the code which has been posted in an attempt to make the various options a bit clearer.

SystemTap

SystemTap remains the presumptive Linux tracing solution of choice. It is hampered by a few problems, though, including usability issues, a complete lack of static trace points in the mainline kernel, and no user-space tracing capability. On the usability side, we are seeing a few more kernel developers trying to put SystemTap to work and posting about the problems they are having. If one takes as a working hypothesis the notion that, if kernel hackers cannot make SystemTap work, many other users are likely to encounter difficulties as well, then one might conclude that addressing the reported problems would be a priority for the SystemTap developers.

The SystemTap developers do seem to be interested in these reports, which is a good sign. There are other things happening in the SystemTap arena, including the release of version 0.7 on July 15. This release adds a number of new features and tapsets, and a substantial set of examples as well. Meanwhile, Anup Shan has posted an interesting integration of SystemTap and the fault injection framework, allowing tapsets to control fault injection and trace the results.

James Bottomley has been playing some with the SystemTap code; one result of that work is changes to SystemTap's internal relocation code in an attempt to make it more acceptable for mainline kernel inclusion. There can be no doubt that the out-of-tree nature of much of the SystemTap support code has made it harder for that code to progress, so any improvement which makes it more likely that some of this code will be merged is welcome.

Also by James is this patch implementing a new way to put markers into the kernel. The addition of markers (or static tracepoints) has always been problematic in that many of these markers, by their nature, need to go into some of the hottest code paths in the kernel. To support dynamic tracing, these markers need to be available on production systems, so they must work without creating any significant performance regressions. Quite a bit of work has gone into the static marker code which is in the kernel (but mostly unused) now, but some developers are still uncomfortable with putting them into performance-critical paths.

James's patch addresses these concerns by putting the tracepoints entirely outside of the code paths. Rather than add some sort of marker to the code, these markers just make a note of just where in the code the marker is supposed to be; this note is stored in a separate part of the kernel binary. That information is enough for a run-time tool to patch in an actual jump to a tracing function should somebody want to see the information from that tracepoint. An additional benefit is that these markers do not interfere with any optimizations done by the compiler. Other solutions can insert optimization barriers which, while they do make life easier for the tracing subsystem, also affect the speed of the code even when the trace points are not active.

Ftrace

The text above said that the kernel's static tracepoint code is "mostly unused." That would have been better expressed as "completely," except that the 2.6.27 kernel will include a user in the form of the ftrace framework. One of the things which makes ftrace truly unique is that its documentation was not only merged before the code itself, but well before: the 2.6.26 kernel includes the excellent Documentation/ftrace.txt file.

The ftrace (which stands for "function tracer") framework is one of the many improvements to come out of the realtime effort. Unlike SystemTap, it does not attempt to be a comprehensive, scriptable facility; ftrace is much more oriented toward simplicity. There is a set of virtual files in a debugfs directory which can be used to enable specific tracers and see the results. The function tracer after which ftrace is named simply outputs each function called in the kernel as it happens. Other tracers look at wakeup latency, events enabling and disabling interrupts and preemption, task switches, etc. As one might expect, the available information is best suited for developers working on improving realtime response in Linux. The ftrace framework makes it easy to add new tracers, though, so chances are good that other types of events will be added as developers think of things they would like to look at.

Tracepoints

The kernel markers mechanism is meant to be the way that static tracepoints are inserted into the kernel. To that end, a great deal of effort went into making these markers fast; they are, for all practical purposes, a set of no-op instructions until somebody wants to turn one on, at which point the real tracing code is patched into the running kernel. Since they were merged, however, kernel markers have been the subject of a few grumbles.

In particular, kernel markers use a somewhat awkward mechanism to ensure that any arguments passed to the tracing function are interpreted correctly there. Each marker has a printk()-style format string associated with it; that string describes the type of each "argument" (a variable or _expression_ within the code being traced). When tracing code activates a marker, it will supply a function to be called when the marker is hit and a format string describing the arguments that the function expects. The marker code will ensure that both format strings match; otherwise the marker will not be enabled. The problem is that the format string requires extra work to write and is only approximate in its specification of the types involved. These strings can make it clear that a given argument is a pointer, for example, but they say nothing about what type is pointed to.

In response to various efforts to get around this issue, Mathieu Desnoyers (the original author of the kernel marker work) has proposed a new mechanism called tracepoints. They are another way of putting static trace points into the kernel, but with a simpler and more type-safe way of putting the pieces together.

With tracepoints, every trace point must be declared in a header file with a mildly ugly set of macros:

    #include <linux/tracepoint.h>

    DEFINE_TRACE(tracepoint_name,
                 TPPROTO(trace_function_prototype),
		 TPARGS(trace_function_args));

This definition will create a new tracepoint called tracepoint_name. Any function attached to that tracepoint must have a function prototype as provided in the TPPROTO() macro; the names of the associated arguments are provided with TPARGS().

Perhaps this is better understood with an example. The tracepoints patch set includes quite a few static points for use with the LTTng tracing toolkit. There is one called sched_wakeup which fires whenever the scheduler wakes up a process. It is defined with:

    DEFINE_TRACE(sched_wakeup,
	         TPPROTO(struct rq *rq, struct task_struct *p),
		 TPARGS(rq, p));

The actual insertion of the tracepoint is a line like this:

    trace_sched_wakeup(rq, p);

Note the trace_ prefix added to the supplied name. At this point in the code, a tracing function can be called with rq (the run queue of interest) and p (the process which is waking up) as parameters. Until an actual function is connected to the tracepoint, though, this declaration is essentially a no-op. Connection of a trace function is done through a call to:

    void my_sched_wakeup_tracer(struct rq *rq, struct task_struct *p);


    register_trace_sched_wakeup(my_sched_wakeup_tracer);

The register_trace_sched_wakeup() function (created as part of the DEFINE_TRACE() definition) will connect the supplied trace function to the tracepoint. The fact that the function prototype for the trace function is supplied as part of the tracepoint definition means that the compiler can perform thorough type checking; if the prototypes do not match up, compilation will fail. And that, in turn, should put an end to those embarrassing situations where turning on tracing causes the system to go down in flames.

Interestingly, tracepoints have dispensed with much of the mechanism developed to minimize the runtime impact of kernel markers; in particular, they do not use the "immediate values" code. Profiling has shown that the performance impact of tracepoints is so low that there is little value in the added complexity of runtime patching of kernel code. Still, there are signs that some kernel developers will object to the addition of tracepoints in their current form. Developers want tracing support - but not at the cost of slower performance, even if that cost is hard to measure.

Tracehook

Finally, Roland McGrath recently surfaced with the tracehook patch set. Tracehook has a rather different focus; it is, essentially, a cleanup of the way the kernel handles the ptrace() system call. The tracehook patches try to organize all of the process tracing code (much of which is architecture-dependent) into one place where it can be dealt with as a unit.

Tracehook is meant to be a first step toward the merging of a new version of the utrace code. Utrace has long been planned as the successor to the current ptrace() implementation, which has few admirers. But utrace has encountered a number of difficulties, so its path into the kernel has been slow. It disappeared from the lists entirely for a while, but a new version of the patches is said to be coming soon; Roland notes that he expects "some vigorous feedback" when that happens.

The real importance of the ptrace() rework is that it is the path toward integrated tracing of kernel- and user-space events. And that, of course, is one of the biggest features offered by DTrace which is not yet available in SystemTap. Getting user-space tracing into the kernel - especially if it could work with the tracepoints already being inserted into some applications for DTrace - would be a major step forward for Linux. A lot of people will be watching when this patch set comes around again.

Meanwhile, Roland would like to see the tracehook code merged for 2.6.27. He is late to the party, though, and this code has not done any time in linux-next. So it is not yet clear whether tracehook will go in before the merge window closes, or whether, instead, it will have to wait for 2.6.28.

In summary...

As can be seen, there is a lot happening in the area of tracing support for Linux. Tracing, it seems, is an idea whose time has come, at last. If the pieces described here can be merged and integrated into a unified framework, and if it can all be made sufficiently easy to use, the time for "DTrace envy" will come to an end. Those "ifs" are not small ones, though. There is quite a bit of work to be done yet; hopefully the current level of energy will remain until the job is done.

[linuxkernelnewbies] Tracing: no shortage of options [LWN.net]