http://eduunix.jlbtc.edu.cn/index/html/linux/OReilly.Understanding.the.Linux.Kernel.3rd.Edition.Nov.2005.HAPPY.NEW.YEAR/0596005652/understandlk-CHP-8-SECT-1.html

8.1.8.1. Allocating page frames through the per-CPU page frame caches

The buffered_rmqueue( ) function allocates page frames in a given memory zone. It makes use of the per-CPU page frame caches to handle single page frame requests.

The parameters are the address of the memory zone descriptor, the order of the memory allocation request order, and the allocation flags gfp_flags. If the _ _GFP_COLD flag is set in gfp_flags, the page frame should be taken from the cold cache, otherwise it should be taken from the hot cache (this flag is meaningful only for single page frame requests). The function essentially executes the following operations:

  1. If order is not equal to 0, the per-CPU page frame cache cannot be used: the function jumps to step 4.

  2. Checks whether the memory zone's local per-CPU cache identified by the value of the _ _GFP_COLD flag has to be replenished (the count field of the per_cpu_pages descriptor is lower than or equal to the low field). In this case, it executes the following substeps:

    1. Allocates batch single page frames from the buddy system by repeatedly invoking the _ _rmqueue( ) function.

    2. Inserts the descriptors of the allocated page frames in the cache's list.

    3. Updates the value of count by adding the number of page frames actually allocated.

  3. If count is positive, the function gets a page frame from the cache's list, decreases count, and jumps to step 5. (Observe that a per-CPU page frame cache could be empty; this happens when the _ _rmqueue( ) function invoked in step 2a fails to allocate any page frames.)

  4. Here, the memory request has not yet been satisfied, either because the request spans several contiguous page frames, or because the selected page frame cache is empty. Invokes the _ _rmqueue( ) function to allocate the requested page frames from the buddy system.

  5. If the memory request has been satisfied, the function initializes the page descriptor of the (first) page frame: clears some flags, sets the private field to zero, and sets the page frame reference counter to one. Moreover, if the _ _GPF_ZERO flag in gfp_flags is set, it fills the allocated memory area with zeros.

  6. Returns the page descriptor address of the (first) page frame, or NULL if the memory allocation request failed.

8.1.8.2. Releasing page frames to the per-CPU page frame caches

In order to release a single page frame to a per-CPU page frame cache, the kernel makes use of the free_hot_page( ) and free_cold_page( ) functions. Both of them are simple wrappers for the free_hot_cold_page( ) function, which receives as its parameters the descriptor address page of the page frame to be released and a cold flag specifying either the hot cache or the cold cache.

The free_hot_cold_page( ) function executes the following operations:

  1. Gets from the page->flags field the address of the memory zone descriptor including the page frame (see the earlier section "Non-Uniform Memory Access (NUMA)").

  2. Gets the address of the per_cpu_pages descriptor of the zone's cache selected by the cold flag.

  3. Checks whether the cache should be depleted: if count is higher than or equal to high, invokes the free_pages_bulk( ) function, passing to it the zone descriptor, the number of page frames to be released (batch field), the address of the cache's list, and the number zero (for 0-order page frames). In turn, the latter function invokes repeatedly the _ _free_pages_bulk( ) function to releases the specified number of page framestaken from the cache's listto the buddy system of the memory zone.

  4. Adds the page frame to be released to the cache's list, and increases the count field.

It should be noted that in the current version of the Linux 2.6 kernel, no page frame is ever released to the cold cache: the kernel always assumes the freed page frame is hot with respect to the hardware cache. Of course, this does not mean that the cold cache is empty: the cache is replenished by buffered_rmqueue( ) when the low watermark has been reached.

8.1.9. The Zone Allocator

The zone allocator is the frontend of the kernel page frame allocator. This component must locate a memory zone that includes a number of free page frames large enough to satisfy the memory request. This task is not as simple as it could appear at a first glance, because the zone allocator must satisfy several goals:

  • It should protect the pool of reserved page frames (see the earlier section "The Pool of Reserved Page Frames").

  • It should trigger the page frame reclaiming algorithm (see Chapter 17) when memory is scarce and blocking the current process is allowed; once some page frames have been freed, the zone allocator will retry the allocation.

  • It should preserve the small, precious ZONE_DMA memory zone, if possible. For instance, the zone allocator should be somewhat reluctant to assign page frames in the ZONE_DMAZONE_NORMAL or ZONE_HIGHMEM page frames. memory zone if the request was for

We have seen in the earlier section "The Zoned Page Frame Allocator" that every request for a group of contiguous page frames is eventually handled by executing the alloc_pages macro. This macro, in turn, ends up invoking the _ _alloc_pages( ) function, which is the core of the zone allocator. It receives three parameters:


gfp_mask

The flags specified in the memory allocation request (see earlier Table 8-5)


order

The logarithmic size of the group of contiguous page frames to be allocated


zonelist

Pointer to a zonelist data structure describing, in order of preference, the memory zones suitable for the memory allocation

The _ _alloc_pages( ) function scans every memory zone included in the zonelist data structure. The code that does this looks like the following:

for (i = 0; (z=zonelist->zones[i]) != NULL; i++) {
    if (zone_watermark_ok(z, order, ...)) {
        page = buffered_rmqueue(z, order, gfp_mask);
        if (page)
            return page;
    }
}

For each memory zone, the function compares the number of free page frames with a threshold value that depends on the memory allocation flags, on the type of current process, and on how many times the zone has already been checked by the function. In fact, if free memory is scarce, every memory zone is typically scanned several times, each time with lower threshold on the minimal amount of free memory required for the allocation. The previous block of code is thus replicated several timeswith minor variationsin the body of the _ _alloc_pages( ) function. The buffered_rmqueue( ) function has been described already in the earlier section "The Per-CPU Page Frame Cache:" it returns the page descriptor of the first allocated page frame, or NULL if the memory zone does not include a group of contiguous page frames of the requested size.

The zone_watermark_ok( ) auxiliary function receives several parameters, which determine a threshold min on the number of free page frames in the memory zone. In particular, the function returns the value 1 if the following two conditions are met:

  1. Besides the page frames to be allocated, there are at least min free page frames in the memory zone, not including the page frames in the low-on-memory reserve (lowmem_reserve field of the zone descriptor).

  2. Besides the page frames to be allocated, there are at least free page frames in blocks of order at least k, for each k between 1 and the order of the allocation. Therefore, if ordermin/2 free page frames in blocks of size at least 2; if order is greater than one, there must be at least min/4 free page frames in blocks of size at least 4; and so on. is greater than zero, there must be at least

The value of the threshold min is determined by zone_watermark_ok( ) as follows:

  • The base value is passed as a parameter of the function and can be one of the pages_min, pages_low, and pages_high zone's watermarks (see the section "The Pool of Reserved Page Frames" earlier in this chapter).

  • The base value is divided by two if the gfp_high flag passed as parameter is set. Usually, this flag is equal to one if the _ _GFP_HIGHMEM flag is set in the gfp_mask, that is, if the page frames can be allocated from high memory.

  • The threshold value is further reduced by one-fourth if the can_try_harder flag passed as parameter is set. This flag is usually equal to one if either the _ _GFP_WAIT flag is set in gfp_mask, or if the current process is a real-time process and the memory allocation is done in process context (outside of interrupt handlers and deferrable functions).

The _ _alloc_pages( ) function essentially executes the following steps:

  1. Performs a first scanning of the memory zones (see the block of code shown earlier). In this first scan, the min threshold value is set to z->pages_low, where z points to the zone descriptor being analyzed (the can_try_harder and gfp_high parameters are set to zero).

  2. If the function did not terminate in the previous step, there is not much free memory left: the function awakens the kswapd kernel threads to start reclaiming page frames asynchronously (see Chapter 17).

  3. Performs a second scanning of the memory zones, passing as base threshold the value z->pages_min. As explained previously, the actual threshold is determined also by the can_try_harder and gfp_high flags. This step is nearly identical to step 1, except that the function is using a lower threshold.

  4. If the function did not terminate in the previous step, the system is definitely low on memory. If the kernel control path that issued the memory allocation request is not an interrupt handler or a deferrable function and it is trying to reclaim page frames (either the PF_MEMALLOC flag or the PF_MEMDIE flag of current is set), the function then performs a third scanning of the memory zones, trying to allocate the page frames ignoring the low-on-memory thresholdsthat is, without invoking zone_watermark_ok( ). This is the only case where the kernel control path is allowed to deplete the low-on-memory reserve of pages specified by the lowmem_reserve field of the zone descriptor. In fact, in this case the kernel control path that issued the memory request is ultimately trying to free page frames, thus it should get what it has requested, if at all possible. If no memory zone includes enough page frames, the function returns NULL to notify the caller of the failure.

  5. Here, the invoking kernel control path is not trying to reclaim memory. If the _ _GFP_WAIT flag of gfp_mask is not set, the function returns NULL to notify the kernel control path of the memory allocation failure: in this case, there is no way to satisfy the request without blocking the current process.

  6. Here the current process can be blocked: invokes cond_resched() to check whether some other process needs the CPU.

  7. Sets the PF_MEMALLOC flag of current, to denote the fact that the process is ready to perform memory reclaiming.

  8. Stores in current->reclaim_state a pointer to a reclaim_state structure. This structure includes just one field, reclaimed_slab, initialized to zero (we'll see how this field is used in the section "Interfacing the Slab Allocator with the Zoned Page Frame Allocator" later in this chapter).

  9. Invokes TRy_to_free_pages( ) to look for some page frames to be reclaimed (see the section "Low On Memory Reclaiming" in Chapter 17). The latter function may block the current process. Once that function returns, _ _alloc_pages( ) resets the PF_MEMALLOC flag of current and invokes once more cond_resched().

  10. If the previous step has freed some page frames, the function performs yet another scanning of the memory zones equal to the one performed in step 3. If the memory allocation request cannot be satisfied, the function determines whether it should continue scanning the memory zone: if the _ _GFP_NORETRY flag is clear and either the memory allocation request spans up to eight page frames, or one of the _ _GFP_REPEAT and _ _GFP_NOFAIL flags is set, the function invokes blk_congestion_wait( ) to put the process asleep for awhile (see Chapter 14), and it jumps back to step 6. Otherwise, the function returns NULL to notify the caller that the memory allocation failed.

  11. If no page frame has been freed in step 9, the kernel is in deep trouble, because free memory is dangerously low and it was not possible to reclaim any page frame. Perhaps the time has come to take a crucial decision. If the kernel control path is allowed to perform the filesystem-dependent operations needed to kill a process (the _ _GFP_FS flag in gfp_mask is set) and the _ _GFP_NORETRY flag is clear, performs the following substeps:

    1. Scans once again the memory zones with a threshold value equal to z->pages_high.

    2. Invokes out_of_memory() to start freeing some memory by killing a victim process (see "The Out of Memory Killer" in Chapter 17).

    3. Jumps back to step 1.

    Because the watermark used in step 11a is much higher than the watermarks used in the previous scannings, that step is likely to fail. Actually, step 11a succeeds only if another kernel control path is already killing a process to reclaim its memory. Thus, step 11a avoids that two innocent processes are killed instead of one.




Reply via email to