I was planning to annotate the opaque calculation in __fragmentation_index() but on closer inspection I think there may be a bug. I could use some feedback.
Firstly, for the case of fragmentation and ignoring the scaling, __fragmentation_index() purports to return a value in the range 0 to 1. Generally, however, the lower bound is actually 0.5. Here's an illustration using a zone that I fragmented with selective calls to __alloc_pages() and __free_pages --- the fragmentation for order-1 could not be minimised further yet is reported as 0.5: # head -1 /proc/buddyinfo Node 0, zone DMA 1983 0 0 0 0 0 0 0 0 0 0 # head -1 /sys/kernel/debug/extfrag/extfrag_index Node 0, zone DMA -1.000 0.500 0.750 0.875 0.937 0.969 0.984 0.992 0.996 0.998 0.999 # This is significant because 0.5 is the default value of sysctl_extfrag_threshold, meaning that compaction will not be suppressed for larger blocks when memory is scarce rather than fragmented. Of course, sysctl_extfrag_threshold is a tuneable so the first question is: does this even matter? The calculation in __fragmentation_index() isn't documented but the apparent error in the lower bound may be explained by showing that the index is approximated by F ~ 1 - 1/N where N is (conceptually) the number of free blocks into which each potential requested-size block has been split. I.e. if all free space were compacted then there would be B free blocks of the requested size where B = info->free_pages/requested and thus N = info->free_blocks_total/B The case of least fragmentation must be when all of the requested-size blocks have been split just once to form twice as many blocks in the next lowest free list. Thus the lowest value of N is 2 and the lowest vale of F is 0.5. I readied a patch that, in essence, defined F = 1 - 2/N and thereby set the bounds of __fragmentation_index() as 0 <= F < 1. Before sending it, I realised that, during testing, I *had* seen the occasional instance of F < 0.5, e.g. F = 0.499. Revisting the calculation, I see that the actual implementation is F = 1 - [1/N + 1/info->free_blocks_total] meaning that a very severe shortage of free memory *could* tip the balance in favour of "low fragmentation". Although this seems highly unlikely to occur outside testing, it does reflect the directive in the comment above the function, i.e. favour page reclaim when fragmentation is low. My second question: is the current implementation of F is intentional and, if not, what is the actual intent? The comments in compaction_suitable() suggest that the compaction/page reclaim decision is one of cost but, as compaction is linear, this isn't what __fragmentation_index() is calculating. A more reasonable argument is that there's generally some lower limit on the fragmentation achievable through compaction, given the inevitable presence of non-migratable pages. Is there anything else going on? Robert Harris