Hallo,
I had some problems with bootmem allocators who need to allocate memory in
the first 4GB. On a NUMA system with enough memory alloc_bootmem would
just go over the nodes with a for_each_pgdat and try them in turn. When
the nodes are added in the straight forward order beginning from 0 to
bootmem they end up reversed on the pgdat_list because init_bootmem_node
always inserts the new node at the head of the list. This results
in alloc_bootmem to look first into the last node and if there
is enough memory there allocate memory. Which can be beyond 4GB.
Anyways, i pondered a few solutions. The best one seems to be to just
reorder the list. I see that IA64 had some magic
code to do the same, but it looked so hackish that I didn't want
to duplicate it. So I just changed init_bootmem to insert at the tail.
I think the generic code doing for_each_pgdat is all ok and doesn't
care about the order, but several architectures do their own
for_each_pgdat() and they might in theory break.
If your architecture does funky things with for_each_pgdat testing this patch
might good. I plan to submit it when 2.6.14 opens.
-Andi
Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -61,9 +61,17 @@ static unsigned long __init init_bootmem
{
bootmem_data_t *bdata = pgdat->bdata;
unsigned long mapsize = ((end - start)+7)/8;
+ static struct pglist_data *pgdat_last;
- pgdat->pgdat_next = pgdat_list;
- pgdat_list = pgdat;
+ pgdat->pgdat_next = NULL;
+ /* Add new nodes last so that bootmem always starts
+ searching in the first nodes, not the last ones */
+ if (pgdat_last)
+ pgdat_last->pgdat_next = pgdat;
+ else {
+ pgdat_list = pgdat;
+ pgdat_last = pgdat;
+ }
mapsize = ALIGN(mapsize, sizeof(long));
bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);