Anatomy of the Linux Slab Allocator

Good operating system performance depends in part on the operating system's ability to efficiently manage resources. In the old days, heap memory managers were the norm, but performance suffered due to fragmentation and the need for memory reclamation. Today, the Linux kernel uses a method that originated in Solaris but has been used in embedded systems for quite some time, allocating memory as objects based on their size. This article explores the ideas behind the slab allocator and examines its interfaces and their use.

By M. Tim Jones
First published by IBM at IBM developerWorks Linux zone. All rights retained by IBM and the author.

Dynamic Memory Management

The goal of memory management is to provide a method by which memory can be dynamically shared amongst a variety of users for a variety of purposes. The memory management method should do both of the following:

Minimize the amount of time required to manage the memory
Maximize the available memory for general usage (minimize management overhead)

Memory management is ultimately a zero-sum game of tradeoffs. You can develop an algorithm that uses little memory for management but takes more time to manage the available memory. You can also develop an algorithm that efficiently manages memory but uses a bit more memory. In the end, the requirements for the particular application drive the balance of the tradeoffs.

Early memory managers used a heap-based allocation strategy. In this method, a large block of memory (called the heap) is used to provide memory for user-defined purposes. When users need a block of memory, they make a request for a given size. The heap manager looks at the available memory (using a particular algorithm) and returns the block. Some of the algorithms used in this search are the first-fit (the first block encountered in the heap that satisfies the request), and the best-fit (the best block in the heap that satisfies the request). When the users are finished with the memory, they return it to the heap.

The fundamental problem with this heap-based allocation strategy is fragmentation. As blocks of memory are allocated, they are returned in different orders and at different times. This tends to leave holes in the heap requiring more time to efficiently manage the free memory. This algorithm tends to be memory efficient (allocating what's necessary) but requires more time to manage the heap.

Another approach, called buddy memory allocation, is a faster memory technique that divides memory into power-of-2 partitions and attempts to allocate memory requests using a best-fit approach. When memory is freed by the user, the buddy block is checked to see if any of its contiguous neighbors have also been freed. If so, the blocks are combined to minimize fragmentation. This algorithm tends to be a bit more time efficient but can waste memory due to the best-fit approach.

This article focuses on Linux kernel memory management and, in particular, the mechanisms provided through slab allocation.

The Slab Cache

The slab allocator used in Linux is based on an algorithm first introduced by Jeff Bonwick for the SunOS operating system. Jeff's allocator revolves around object caching. Within a kernel, a considerable amount of memory is allocated for a finite set of objects such as file descriptors and other common structures. Jeff found that the amount of time required to initialize a regular object in the kernel exceeded the amount of time required to allocate and deallocate it. His conclusion was that instead of freeing the memory back to a global pool, he would have the memory remain initialized for its intended purpose. For example, if memory is being allocated for a mutex, the mutex initialization function (mutex_init) need only be performed once when the memory is first allocated for the mutex. Subsequent allocations of the memory need not perform the initialization because it's already in the desired state from the previous deallocation and call to the deconstructor.

The Linux slab allocator uses these ideas and others to build a memory allocator that is efficient in both space and time.

Figure 1 illustrates the high-level organization of the slab structures. At the highest level is the cache_chain, which is a linked list of the slab caches. This is useful for best-fit algorithms that look for a cache that most closely fits the size of the desired allocation (iterating the list). Each element of the cache_chain is a kmem_cache structure reference (called a cache). This defines a pool of objects of a given size to manage.

The major structures of the slab allocator

Figure 1: The major structures of the slab allocator

Each cache contains a list of slabs, which are contiguous blocks of memory (typically pages). Three slabs exist:

slabs_full: Slabs that are fully allocated.
slabs_partial: Slabs that are partially allocated.
slabs_empty: Slabs that are empty, or no objects allocated.

Note that the slabs on the slabs_empty list are prime candidates for reaping. This is the process by which the memory used by slabs is provided back to the operating system for other uses.

Each slab in the slab list is a contiguous block of memory (one or more contiguous pages) that is divided into objects. These objects are the fundamental elements that are allocated and deallocated from the particular cache. Note that the slab is the smallest allocation to the slab allocator, so if it needs to grow, this is the minimum by which it will grow. Typically, numerous objects are allocated per slab.

As objects are allocated and deallocated from a slab, the individual slab can move between the slab lists. For example, when all objects are consumed in a slab, it moves from the slabs_partial list to the slabs_full list. When a slab is full and an object is deallocated, it moves from the slabs_full list to the slabs_partial list. When all objects are deallocated, it moves from the slabs_partial list to the slabs_empty list.

Motivation Behind the Slab

The slab cache allocator provides a number of benefits over traditional memory management schemes. First, kernels commonly rely on the allocation of small objects that are allocated numerous times over the lifetime of the system. The slab cache allocator provides this through the caching of similarly sized objects, thus avoiding the fragmentation problems that commonly occur. The slab allocator also supports the initialization of common objects, thus avoiding the need to repeatedly initialize an object for the same purpose. Finally, the slab allocator supports hardware cache alignment and coloring, which allows objects in different caches to occupy the same cache lines for increased cache utilization and better performance.

API Functions

Now on to the application program interface (API) for creating new slab caches, adding memory to the caches, destroying the caches, as well as the functions to allocate and deallocate objects from them.

The first step is to create your slab cache structure, which you can create statically as:

static struct kmem_cache *my_cachep;

Linux Source for the Slab Cache
You can find the slab cache source in ./linux/mm/slab.c. The kmem_cache structure is also defined in ./linux/mm/slab.c. The discussion in this article focuses on the current implementation in the 2.6.21 Linux kernel.

This reference is then used by the other slab cache functions for creation, deletion, allocation, and so on. The kmem_cache structure contains the per-central processing unit (CPU) data, a set of tunables (accessible through the proc file system), statistics, and necessary elements to manage the slab cache.

kmem_cache_create

Kernel function kmem_cache_create is used to create a new cache. This is commonly performed at kernel init time or when a kernel module is first loaded. Its prototype is defined as:

struct kmem_cache *
kmem_cache_create( const char *name, size_t size, size_t align,
                       unsigned long flags;
                       void (*ctor)(void*, struct kmem_cache *, unsigned long),
                       void (*dtor)(void*, struct kmem_cache *, unsigned long));

The name argument defines the name of the cache, which is used by the proc file system (in /proc/slabinfo) to identify this cache. The size argument specifies the size of the objects that should be created for this cache, with the align argument defining the required alignment for each object. The flags argument specifies options to enable for the cache. These flags are shown in Table 1.

Table 1. Partial list of options for kmem_cache_create (as specified in flags)

Option	Description
SLAB_RED_ZONE	Insert markers at header and trailer of the object to support checking of buffer overruns.
SLAB_POISON	Fill a slab with a known pattern to allow monitoring of objects in the cache (objects owned by the cache, but modified externally).
SLAB_HWCACHE_ALIGN	Specify that the objects in this cache must be aligned to the hardware cachline.

The ctor and dtor arguments define an optional object constructor and deconstructor. The constructor and deconstructor are callback functions provided by the user. When a new object is allocated from the cache, it can be initialized through the constructor.

After the cache is created, its reference is returned by the kmem_cache_create function. Note that this function allocates no memory to the cache. Instead, when attempts are made to allocate objects from the cache (when it's initially empty), memory is allocated to it through a refill operation. This same operation is used to add memory to the cache when all its objects are consumed.

kmem_cache_destroy

Kernel function kmem_cache_destroy is used to destroy a cache. This call is performed by kernel modules when they are unloaded. The cache must be empty before this function is called.

void kmem_cache_destroy( struct kmem_cache *cachep );

kmem_cache_alloc

To allocate an object from a named cache, the kmem_cache_alloc function is used. The caller provides the cache from which to allocate an object and a set of flags:

void* kmem_cache_alloc( struct kmem_cache *cachep, gfp_t flags );

This function returns an object from the cache. Note that if the cache is currently empty, this function may invoke cache_alloc_refill to add memory to the cache. The flag options for kmem_cache_alloc are the same as those for kmalloc. Table 2 provides a partial list of the flag options.

Table 2. Flag options for the kmem_cache_alloc and kmalloc kernel functions

Flag	Description
GFP_USER	Allocate memory for a User (this call may sleep).
GFP_KERNEL	Allocate memory from kernel RAM (this call may sleep).
GFP_ATOMIC	Force no-sleep on this call (useful for interrupt handlers).
GFP_HIGHUSER	Allocate from high memory.

Slab Allocation for NUMA
For Non-Uniform Memory Access (NUMA) architectures, the allocation function for a specified node is kmem_cache_alloc_node.

kmem_cache_zalloc

The kernel function kmem_cache_zalloc is similar to kmem_cache_alloc, except that it performs a memset of the object to clear it out prior to returning the object to the caller.

kmem_cache_free

To free an object back to the slab, the kmem_cache_free is used. The caller provides the cache reference and the object to be freed.

void kmem_cache_free( struct kmem_cache *cachep, void *objp );

kmalloc and kfree

The most common memory management functions in the kernel are the kmalloc and kfree functions. The prototypes for these functions are defined as:

void *kmalloc( size_t size, int flags );
void kfree( const void *objp );

Note that in kmalloc the only arguments are the requested size of the object to allocate and a set of flags (see the partial list in Table 2). But kmalloc and kfree use the slab cache just like the previously-defined functions. Instead of naming a specific slab cache from which to allocate an object, the kmalloc function iterates through the available caches looking for one that can satisfy its size constraint. When found, an object is allocated (using __kmem_cache_alloc). To free an object with kfree, the cache from which the object was allocated is determined with a call to virt_to_cache. This function returns the cache reference that is then used in a call to __cache_free to release the object.

Generic Object Allocation
Internally in the slab source, a function named kmem_find_general_cachep is provided that performs a cache search looking for a slab cache that best fits the necessary object size.

Other Miscellaneous Functions

The slab cache API provides a number of other useful functions. The kmem_cache_size function returns the size of the objects that are managed by this cache. You can also retrieve the name of a given cache (as defined at cache creation time) through a call to kmem_cache_name. A cache can be shrunk by releasing free slabs in the cache. This can be accomplished with a call to kmem_cache_shrink. Note that this action (called reaping) is performed automatically on a periodic basis by the kernel (through kswapd).

unsigned int kmem_cache_size( struct kmem_cache *cachep );
const char *kmem_cache_name( struct kmem_cache *cachep );
int kmem_cache_shrink( struct kmem_cache *cachep );

Example Use of the Slab Cache

The following code snippets demonstrate the creation of a new slab cache, allocating and deallocating objects from the cache, and then destroying the cache. To begin, a kmem_cache object must be defined and then initialized (see Listing 1). This particular cache contains 32-byte objects and is hardware-cache aligned (as defined by the flags argument SLAB_HWCACHE_ALIGN).

Listing 1. Creating a new slab cache

                
static struct kmem_cache *my_cachep;


static void init_my_cache( void )
{

   my_cachep = kmem_cache_create( 
                  "my_cache",            /* Name */
                  32,                    /* Object Size */
                  0,                     /* Alignment */
                  SLAB_HWCACHE_ALIGN,    /* Flags */
                  NULL, NULL );          /* Constructor/Deconstructor */

   return;
}

With your slab cache allocated, you can now allocate an object from it. Listing 2 provides an example of allocating and deallocating an object from the cache. It also demonstrates two of the miscellaneous functions.

Listing 2. Allocating and Deallocating Objects

                
int slab_test( void )
{
  void *object;

  printk( "Cache name is %s\n", kmem_cache_name( my_cachep ) );
  printk( "Cache object size is %d\n", kmem_cache_size( my_cachep ) );

  object = kmem_cache_alloc( my_cachep, GFP_KERNEL );

  if (object) {

    kmem_cache_free( my_cachep, object );

  }

  return 0;
}

Finally, Listing 3 is an example of destroying a slab cache. The caller must ensure that no attempts to allocate objects from the cache are performed during the destroy operation.

Listing 3. Destroying a slab cache

                
static void remove_my_cache( void )
{

  if (my_cachep) kmem_cache_destroy( my_cachep );

  return;
}

Proc Interface to the Slab

The proc file system provides a simple way to monitor all slab caches that are active in the system. This file, called /proc/slabinfo, provides detailed information about all slab caches in addition to providing a few tunables that are accessible from user space. The current version of slabinfo provides a header so that the output is a bit more readable. For each slab cache in the system, the number of objects, number of active objects, and the object size is provided (in addition to the objects and pages per slab). A set of tunables and slab data are also provided.

To tune a particular slab cache, simply echo the slab cache name and the three tunable parameters as a string to the /proc/slabinfo file. The following example illustrates how to increase the limit and batchcount while leaving the shared factor as is (format is "cache name limit batchcount shared factor"):

# echo "my_cache 128 64 8" > /proc/slabinfo

The limit field indicates the maximum number of objects that will be cached for each CPU. The batchcount field is the maximum number of global cache objects that will be transferred to the per-CPU cache when it becomes empty. The shared parameter indicates the sharing behavior for Symmetric MultiProcessing (SMP) systems.

Note that you must have superuser privileges to tune parameters in the proc file system for a slab cache.

The SLOB Allocator

For small embedded systems, a slab emulation layer exists called the SLOB. This slab replacement is advantageous in small embedded Linux systems, but while it conserves up to 512KB of memory, it suffers from fragmentation and poor scaling. When CONFIG_SLAB is disabled, the kernel falls back to this SLOB allocator. See the Resources section for more details.

Going Further

The source code for the slab cache allocator is actually one of the more readable aspects of the Linux kernel. Outside of the indirection that exists in the function calls, the source is intuitive and, in general, well commented. If you'd like to know more about the slab cache allocator, I suggest that you start there as it's the most up-to-date documentation on the mechanism. The Resources section below offers a number of sources that describe the slab cache allocator, but unfortunately all of them are out of date given the current 2.6 implementation.

Resources

Learn

"The Slab Allocator: An Object-Caching Kernel Memory Allocator (1994)" is Jeff Bonwick's original paper, which describes the first slab allocator that appeared in the SunOS 5.4 kernel memory allocator.
"The Linux Slab Allocator (200)" describes the Linux version of the slab allocator. This paper describes the 2.4 kernel version, which has since been updated.
The SLOB allocator is an implementation of the SLAB cache for memory-constrained systems. This can be enabled through kernel configuration.
The online book Understanding the Linux Virtual Memory Manager (PDF format), by Mel Gorman, gives a detailed look at memory management within Linux. You can download it from Prentice Hall.
In the developerWorks Linux zone, find more resources for Linux developers, including our readers' favorite Linux articles and tutorials over the last month.
Stay current with developerWorks technical events and Webcasts.

Get products and technologies

Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Get involved in the developerWorks community through our developer blogs, forums, podcasts, and community topics in our new developerWorks spaces.

About the Author


		M. Tim Jones is an embedded software engineer and the author of GNU/Linux Application Programming, AI Application Programming (now in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.

[linuxkernelnewbies] Anatomy of the Linux Slab Allocator | Linux Developer Network