Hi Mel, > You're pretty much on the button here. Only one thread at a time enters > zone_reclaim. The others back off and try the next zone in the zonelist > instead. I'm not sure what the original intention was but most likely it > was to prevent too many parallel reclaimers in the same zone potentially > dumping out way more data than necessary. > > > I'm not sure if there is an easy way to fix this without penalising other > > workloads though. > > > > You could experiment with waiting on the bit if the GFP flags allowi it? The > expectation would be that the reclaim operation does not take long. Wait > on the bit, if you are making the forward progress, recheck the > watermarks before continueing.
Thanks to you and Christoph for some suggestions to try. Attached is a chart showing the results of the following tests: baseline.txt The current ppc64 default of zone_reclaim_mode = 0. As expected we see no change in remote node memory usage even after 10 iterations. zone_reclaim_mode.txt Now we set zone_reclaim_mode = 1. On each iteration we continue to improve, but even after 10 runs of stream we have > 10% remote node memory usage. reclaim_4096_pages.txt Instead of reclaiming 32 pages at a time, we try for a much larger batch of 4096. The slope is much steeper but it still takes around 6 iterations to get almost all local node memory. wait_on_busy_flag.txt Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest we would need to check the GFP flags etc, but so far it looks the most promising. We only get a few percent of remote node memory on the first iteration and get all local node by the second. Perhaps a combination of larger batch size and waiting on the busy flag is the way to go? Anton
<<attachment: stream_test:_percentage_off_node_memory.png>>
--- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-22 03:22:01.000000000 -0600 @@ -2534,7 +2534,7 @@ .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, - SWAP_CLUSTER_MAX), + 4096), .gfp_mask = gfp_mask, .swappiness = vm_swappiness, .order = order,
--- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-21 23:47:31.000000000 -0600 @@ -2634,8 +2634,8 @@ if (node_state(node_id, N_CPU) && node_id != numa_node_id()) return ZONE_RECLAIM_NOSCAN; - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) - return ZONE_RECLAIM_NOSCAN; + while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) + cpu_relax(); ret = __zone_reclaim(zone, gfp_mask, order); zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
_______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev