Kernel markers

By Jonathan Corbet
August 15, 2007

LWN's recent look at SystemTap noted that the patch set currently lacks a set of static probe points like that provided with DTrace. There are a few reasons for this difference. For example, the rate of change of the kernel code base would make the maintenance of a large set of probe points difficult, especially given that developers working on many parts of the code might not be particularly aware of - or concerned about - those points. But there is also the simple fact that the Linux kernel has no built-in mechanism for the creation of static probe points in the first place.

There is, naturally, a patch which makes the creation of probe points possible; it is called Linux kernel markers. This patch has been under development for some years. Its path into the mainline has been relatively rough, but there are signs that the worst of the roadblocks have been overcome. So perhaps a quick look at this patch is called for.

With kernel markers, the placement of a probe point is easy:

    #include <linux/marker.h>

    trace_mark(name, format_string, ...);

The name is a unique identifier which is used to access the probe; the documentation recommends a subsystem_event format, describing the subsystem in which the probe is found and the event which is being traced. For example: in a part of the patch which instruments the block subsystem, a probe placed in elv_insert(), which inserts a request into its proper location in the queue, is named blk_request_insert. The format string describes the remaining arguments, each of which will be some variable of interest at the time the trace point is hit.

Code which wants to hook into a trace point must call:

    int marker_probe_register(const char *name, const char *format,
			      marker_probe_func *probe, void *pdata);

Here, name is the name of the trace point, format is the format string describing the expected parameters from the trace point (it must match the format string provided when the trace point was established), probe() is the function to call when the trace point is hit, and pdata is a private data value to pass to probe(). The probe() function will have this prototype:

    void (*probe)(const struct __mark_marker *mdata, void *pdata,
                  const char *format, ...);

The mdata structure includes the name of the trace point, if need be, along with a formatted version of the arguments. The arguments themselves are passed after the format string.

Registration of a marker does not, yet, set up the probe() function to be called. First, the marker must be armed with:

    int marker_arm(const char *name);

Once the marker has been armed, probe() will be called every time execution arrives at the given trace point.

When probe points are no longer of interest, they can be shut down with:

    int marker_disarm(const char *name);
    void marker_probe_unregister(const char *name);

Calls to marker_arm() will nest - if a given marker has been armed three times, then three marker_disarm() calls will be required to turn it off again.

Internally, there are a lot of details to the management of markers. The code at the actual trace point, in the end, looks much like one would expect:

    if (marker_is_armed) {
        preempt_disable();
	(*probe)(...);
	preempt_enable();
    }

In reality, it is not quite so simple. Getting marker support into the kernel requires that the runtime impact of kernel markers be as close to zero as possible, especially when the marker is not armed. A common use case for markers is to investigate performance problems on systems running in production, so they have to be present in production kernels without causing performance problems themselves. Adding a test-and-jump operation to a kernel hot path will always be a hard sell; the cache effects of referencing a set of global marker state variables could also be significant.

To get around this problem, the marker code comes with a separate patch called immediate values. In the architecture-independent implementation, an immediate value just looks like any other shared variable. The purpose of immediate values, though, is to provide variables with the assumption that they will be frequently read but infrequently changed, and that the read operations must have the lowest impact possible. So, in an architecture-specific implementation (which only exists for i386 at the moment), changing an immediate value actually patches any code which reads the value. To say that the details of doing this sort of patching safely are ugly would be to understate the point. But Mathieu Desnoyers has dealt with those details, and nobody else need look at the resulting code.

Through the use of immediate values, the code inserted by trace_mark() can query the setting of a trace point without generating a memory reference at all; instead, that setting is stored directly in the inserted code. So there will be no potential for an expensive cache miss at the probe point. The patch also provides an immediate_if() construct which is intended to allow jumps to be patched directly into the code, eliminating the test altogether, but that functionality has not yet been implemented. Even without this feature, immediate values allow the creation of trace points whose runtime impact is very nearly zero, eliminating the most common objection to their existence.

If and when this code is merged, the way will be clear for the creation of a set of well-defined trace points for utilities like SystemTap and LTTng. That, in turn, could make the internal operations of the kernel more visible to system administrators and others who are not necessarily well versed in how the kernel works. This sort of tracing ability has been on many users' wish lists for some time; they might just be, finally, getting close to having that wish fulfilled.

Kernel markers

Posted Aug 16, 2007 15:58 UTC (Thu) by compudj (subscriber, #43335) [Link]

Hi,

Being the developer behind the Linux Kernel Markers, I would like to add some clarifications:

The argument "For example, the rate of change of the kernel code base would make the maintenance of a large set of probe points difficult, especially given that developers working on many parts of the code might not be particularly aware of - or concerned about - those points." should be put in context. Actually, if we think of tracing as being provided not only to Linux users, but also to kernel developers, embedded developers, etc, it makes sense to have an intrumentation set which follows the kernel source as closely as possible.

SystemTap's approach of creating an external tapset that have to be ported to each new kernel is correct for users of distribution kernels, but requires much more effort when trying to adapt the tapsets to a rapidly changing code base.

One of the advantages of putting markers at important locations in the kernel code is that they follow the code flow and use the underlying revision control system of the kernel. Moreover, the developers who are the most likely to know what trace points must be inserted and where are probably the maintainers of a subsystem themself. Therefore, it makes sense for them to be the final deciders on wether or not a trace point belongs to a kernel code path.

Second point, the immediate values optimized versions are currently implemented for i386 and powerpc.

Thanks for this thorough coverage of the state of the markers/immediate values.

Mathieu

Kernel markers

Posted Aug 16, 2007 16:22 UTC (Thu) by davecb (subscriber, #1574) [Link]

It's interesting to compare these with the mostly-dynamic but
sometimes-static approach used in dtrace: almost any function
can be traced without great effort, but a few built-in
static trace points provide higher-level semantics.

A decent explanation of that was Brian Cantrill's comment
at http://lwn.net/Articles/244646/

I speculate that developers of particular kernel areas
would be active in creating and maintaining high-level and
correlated trace information, and end users could take
advantage of them, but being able to trace on, for example,
entry to an arbitrary function would be th next-most-used
functionality.

--dave

Kernel markers

Posted Oct 2, 2007 19:35 UTC (Tue) by fuhchee (subscriber, #40059) [Link]

> If and when this code is merged, the way will be clear for the creation of a
> set of well-defined trace points for utilities like SystemTap and LTTng.

Indeed. It may be interesting to point out the systemtap script interface
to the markers - for those who don't want to write their instrumentation
in C. This one just traces the first argument. (It's already working,
which makes sense since we prototyped markers with/before Mathieu.)

probe kernel.mark("name") { println ($arg1) }

We look forward to the inclusion of markers in linux, and their gradual
utilization throughout the code base.

[linuxkernelnewbies] Kernel markers [LWN.net]

Kernel markers

Reply via email to