http://eduunix.jlbtc.edu.cn/index/html/linux/OReilly.Understanding.the.Linux.Kernel.3rd.Edition.Nov.2005.HAPPY.NEW.YEAR/0596005652/understandlk-CHP-8-SECT-1.html
8.1.8.1. Allocating page frames
through the per-CPU page frame caches
The
buffered_rmqueue( ) function allocates
page frames in a given memory zone. It makes use of the per-CPU page
frame caches to handle single page frame requests.
The
parameters are the address of the memory zone descriptor, the order of
the memory allocation request order, and the allocation flags gfp_flags.
If the _ _GFP_COLD flag is set in gfp_flags,
the page frame should be taken from the cold cache, otherwise it should
be taken from the hot cache (this flag is meaningful only for single
page frame requests). The function essentially executes the following
operations:
-
If
order is not equal to 0, the per-CPU page frame cache cannot be used:
the function jumps to step 4.
-
Checks
whether the memory zone's local per-CPU cache identified by the value
of the _ _GFP_COLD flag has to be replenished (the count field of the
per_cpu_pages descriptor is lower than or equal to the low field). In
this case, it executes the following substeps:
-
Allocates batch single page frames
from the buddy system by repeatedly invoking the _ _rmqueue( ) function.
-
Inserts the descriptors of the
allocated page frames in the cache's list.
-
Updates the value of count by
adding the number of page frames actually allocated.
-
If
count is positive, the function gets a page frame from the cache's
list, decreases count, and jumps to step 5. (Observe that a per-CPU
page frame cache could be empty; this happens when the _ _rmqueue( )
function invoked in step 2a fails to allocate any page frames.)
-
Here,
the memory request has not yet been satisfied, either because the
request spans several contiguous page frames, or because the selected
page frame cache is empty. Invokes the _ _rmqueue( ) function to
allocate the requested page frames from the buddy system.
-
If
the memory request has been satisfied, the function initializes the
page descriptor of the (first) page frame: clears some flags, sets the
private field to zero, and sets the page frame reference counter to
one. Moreover, if the _ _GPF_ZERO flag in gfp_flags is set, it fills
the allocated memory area with zeros.
-
Returns
the page descriptor address of the (first) page frame, or NULL if the
memory allocation request failed.
8.1.8.2. Releasing page frames to
the per-CPU page frame caches
In
order to release a single page frame to a per-CPU page frame cache, the
kernel makes use of the free_hot_page( ) and free_cold_page( )
functions. Both of them are simple wrappers for the free_hot_cold_page(
) function, which receives as its parameters the descriptor address page
of the page frame to be released and a cold flag specifying either the
hot cache or the cold cache.
The
free_hot_cold_page( ) function executes the following operations:
-
Gets
from the page->flags field the address of the memory zone descriptor
including the page frame (see the earlier section "Non-Uniform
Memory Access (NUMA)").
-
Gets
the address of the per_cpu_pages descriptor of the zone's cache
selected by the cold flag.
-
Checks
whether the cache should be depleted: if count is higher than or equal
to high, invokes the free_pages_bulk( ) function, passing to it the
zone descriptor, the number of page frames to be released (batch
field), the address of the cache's list, and the number zero (for
0-order page frames). In turn, the latter function invokes repeatedly
the _ _free_pages_bulk( ) function to releases the specified number of
page framestaken from the cache's listto the buddy system of the memory
zone.
-
Adds
the page frame to be released to the cache's list, and increases the
count field.
It
should be noted that in the current version of
the Linux 2.6 kernel, no page frame is ever released to the cold cache:
the kernel always assumes the freed page frame is hot with respect to
the hardware cache. Of course, this does not mean that the cold cache
is empty: the cache is replenished by buffered_rmqueue( ) when the low
watermark has been reached.
8.1.9. The Zone Allocator
The zone allocator
is the frontend of the kernel page frame allocator. This component must
locate a memory zone that includes a number of free page frames large
enough to satisfy the memory request. This task is not as simple as it
could appear at a first glance, because the zone allocator must satisfy
several goals:
-
It
should protect the pool of reserved page frames (see the earlier
section "The
Pool of Reserved Page Frames").
-
It
should trigger the page frame reclaiming algorithm (see Chapter
17)
when memory is scarce and blocking the current process is allowed; once
some page frames have been freed, the zone allocator will retry the
allocation.
-
It
should preserve the small, precious ZONE_DMA memory zone, if possible.
For instance, the zone allocator should be somewhat reluctant to assign
page frames in the ZONE_DMAZONE_NORMAL or ZONE_HIGHMEM page frames.
memory zone if the request was for
We
have seen in the earlier section "The
Zoned Page Frame Allocator" that every request for a group of
contiguous page frames is eventually handled by executing the
alloc_pages macro. This macro, in turn, ends up invoking the _
_alloc_pages( ) function, which is the core of the zone allocator. It
receives three parameters:
gfp_mask
-
The
flags specified in the memory allocation request (see earlier Table
8-5)
order
-
The
logarithmic size of the group of contiguous page frames to be allocated
zonelist
-
Pointer
to a zonelist data structure describing, in order of preference, the
memory zones suitable for the memory allocation
The _
_alloc_pages( ) function scans every memory zone included in the
zonelist data structure. The code that does this looks like the
following:
for (i = 0; (z=zonelist->zones[i]) != NULL; i++) {
if (zone_watermark_ok(z, order, ...)) {
page = buffered_rmqueue(z, order, gfp_mask);
if (page)
return page;
}
}
For
each memory zone, the function compares the
number of free page frames with a threshold value that depends on the
memory allocation flags, on the type of current process, and on how
many times the zone has already been checked by the function. In fact,
if free memory is scarce, every memory zone is typically scanned
several times, each time with lower threshold on the minimal amount of
free memory required for the allocation. The previous block of code is
thus replicated several timeswith minor variationsin the body of the _
_alloc_pages( ) function. The buffered_rmqueue( ) function has been
described already in the earlier section "The
Per-CPU Page Frame Cache:" it returns the page descriptor of the
first allocated page frame, or NULL if the memory zone does not include
a group of contiguous page frames of the requested size.
The
zone_watermark_ok( ) auxiliary function receives several parameters,
which determine a threshold min
on the number of free page frames in the memory zone. In particular,
the function returns the value 1 if the following two conditions are
met:
-
Besides
the page frames to be allocated, there are at least min free page
frames in the memory zone, not including the page frames in the
low-on-memory reserve (lowmem_reserve field of the zone
descriptor).
-
Besides
the page frames to be allocated, there are at least free page frames in blocks of order
at least k, for each k between 1 and the order of the
allocation. Therefore, if ordermin/2 free page frames in blocks of size
at least 2; if order is greater than one, there must be at least min/4
free page frames in blocks of size at least 4; and so on. is
greater than zero, there must be at least
The
value of the threshold min is determined by zone_watermark_ok( ) as
follows:
-
The
base value is passed as a parameter of the function and can be one of
the pages_min, pages_low, and pages_high zone's watermarks (see the
section "The
Pool of Reserved Page Frames" earlier in this chapter).
-
The
base value is divided by two if the gfp_high flag passed as parameter
is set. Usually, this flag is equal to one if the _ _GFP_HIGHMEM flag
is set in the gfp_mask, that is, if the page frames can be allocated
from high memory.
-
The
threshold value is further reduced by one-fourth if the can_try_harder
flag passed as parameter is set. This flag is usually equal to one if
either the _ _GFP_WAIT flag is set in gfp_mask,
or if the current process is a real-time process and the memory
allocation is done in process context (outside of interrupt handlers
and deferrable functions).
The _
_alloc_pages( ) function essentially executes the following steps:
-
Performs
a first scanning of the memory zones (see the block of code shown
earlier). In this first scan, the min threshold value is set to
z->pages_low, where z points to the zone descriptor being analyzed
(the can_try_harder and gfp_high parameters are set to zero).
-
If
the function did not terminate in the previous step, there is not much
free memory left: the function awakens the kswapd kernel threads to start reclaiming page
frames asynchronously (see Chapter
17).
-
Performs
a second scanning of the memory zones, passing as base threshold the
value z->pages_min. As explained previously, the actual threshold is
determined also by the can_try_harder and gfp_high flags. This step is
nearly identical to step 1, except that the function is using a lower
threshold.
-
If
the function did not terminate in the previous step, the system is
definitely low on memory. If the kernel control path that issued the
memory allocation request is not an interrupt handler or a deferrable
function and it is trying to reclaim page frames (either the PF_MEMALLOC
flag or the PF_MEMDIE flag of current
is set), the function then performs a third scanning of the memory
zones, trying to allocate the page frames ignoring the low-on-memory
thresholdsthat is, without invoking zone_watermark_ok( ). This is the
only case where the kernel control path is allowed to deplete the
low-on-memory reserve of pages specified by the lowmem_reserve
field of the zone descriptor. In fact, in this case the kernel control
path that issued the memory request is ultimately trying to free page
frames, thus it should get what it has requested, if at all possible.
If no memory zone includes enough page frames, the function returns NULL
to notify the caller of the failure.
-
Here,
the invoking kernel control path is not trying to reclaim memory. If
the _ _GFP_WAIT flag of gfp_mask is not set, the function returns NULL
to notify the kernel control path of the memory allocation failure: in
this case, there is no way to satisfy the request without blocking the
current process.
-
Here
the current process can be blocked: invokes cond_resched() to check
whether some other process needs the CPU.
-
Sets
the PF_MEMALLOC flag of current, to denote the fact that the process is
ready to perform memory reclaiming.
-
-
Invokes
TRy_to_free_pages( ) to look for some page frames to be reclaimed (see
the section "Low
On Memory Reclaiming" in Chapter
17). The latter function may block the current process. Once that
function returns, _ _alloc_pages( ) resets the PF_MEMALLOC flag of
current and invokes once more cond_resched().
-
If
the previous step has freed some page frames, the function performs yet
another scanning of the memory zones equal to the one performed in step
3. If the memory allocation request cannot be satisfied, the function
determines whether it should continue scanning the memory zone: if the
_ _GFP_NORETRY flag is clear and either the memory allocation request
spans up to eight page frames, or one of the _ _GFP_REPEAT and _
_GFP_NOFAIL flags is set, the function invokes blk_congestion_wait( )
to put the process asleep for awhile (see Chapter
14), and it jumps back to step 6. Otherwise, the function returns
NULL to notify the caller that the memory allocation failed.
-
If
no page frame has been freed in step 9, the kernel is in deep trouble,
because free memory is dangerously low and it was not possible to
reclaim any page frame. Perhaps the time has come to take a crucial
decision. If the kernel control path is allowed to perform the
filesystem-dependent operations needed to kill a process (the _ _GFP_FS
flag in gfp_mask is set) and the _ _GFP_NORETRY flag is clear, performs
the following substeps:
-
Scans once again the memory zones
with a threshold value equal to z->pages_high.
-
-
Because
the watermark used in step 11a is much higher than the watermarks used
in the previous scannings, that step is likely to fail. Actually, step
11a succeeds only if another kernel control path is already killing a
process to reclaim its memory. Thus, step 11a avoids that two innocent
processes are killed instead of one.
|