On Wed, 2015-03-25 at 10:54 +0000, Mel Gorman wrote: > On Mon, Mar 23, 2015 at 04:46:21PM +0800, Huang Ying wrote: > > > My attention is occupied by the automatic NUMA regression at the moment > > > but I haven't forgotten this. Even with the high client count, I was not > > > able to reproduce this so it appears to depend on the number of CPUs > > > available to stress the allocator enough to bypass the per-cpu allocator > > > enough to contend heavily on the zone lock. I'm hoping to think of a > > > better alternative than adding more padding and increasing the cache > > > footprint of the allocator but so far I haven't thought of a good > > > alternative. Moving the lock to the end of the freelists would probably > > > address the problem but still increases the footprint for order-0 > > > allocations by a cache line. > > > > Any update on this? Do you have some better idea? I guess this may be > > fixed via putting some fields that are only read during order-0 > > allocation with the same cache line of lock, if there are any. > > > > Sorry for the delay, the automatic NUMA regression took a long time to > close and it potentially affected anybody with a NUMA machine, not just > stress tests on large machines. > > Moving it beside other fields shifts the problems. The lock is related > to the free areas so it really belongs nearby and from my own testing, > it does not affect mid-sized machines. I'd rather not put the lock in its > own cache line unless we have to. Can you try the following untested patch > instead? It is untested but builds and should be safe. > > It'll increase the footprint of the page allocator but so would padding. > It means it will contend with high-order free page breakups but that > is not likely to happen during stress tests. It also collides with flags > but they are relatively rarely updated. > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index f279d9c158cd..2782df47101e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -474,16 +474,15 @@ struct zone { > unsigned long wait_table_bits; > > ZONE_PADDING(_pad1_) > - > - /* Write-intensive fields used from the page allocator */ > - spinlock_t lock; > - > /* free areas of different sizes */ > struct free_area free_area[MAX_ORDER]; > > /* zone flags, see below */ > unsigned long flags; > > + /* Write-intensive fields used from the page allocator */ > + spinlock_t lock; > + > ZONE_PADDING(_pad2_) > > /* Write-intensive fields used by page reclaim */
Stress page allocator tests here shows that the performance restored to its previous level with the patch above. I applied your patch on lasted upstream kernel. Result is as below: testbox/testcase/testparams: brickland1/aim7/performance-6000-page_test c875f421097a55d9 dbdc458f1b7d07f32891509c06 ---------------- -------------------------- %stddev %change %stddev \ | \ 84568 ± 1% +94.3% 164280 ± 1% aim7.jobs-per-min 2881944 ± 2% -35.1% 1870386 ± 8% aim7.time.voluntary_context_switches 681 ± 1% -3.4% 658 ± 0% aim7.time.user_time 5538139 ± 0% -12.1% 4867884 ± 0% aim7.time.involuntary_context_switches 44174 ± 1% -46.0% 23848 ± 1% aim7.time.system_time 426 ± 1% -48.4% 219 ± 1% aim7.time.elapsed_time 426 ± 1% -48.4% 219 ± 1% aim7.time.elapsed_time.max 468 ± 1% -43.1% 266 ± 2% uptime.boot 13691 ± 0% -24.2% 10379 ± 1% softirqs.NET_RX 931382 ± 2% +24.9% 1163065 ± 1% softirqs.RCU 407717 ± 2% -36.3% 259521 ± 9% softirqs.SCHED 19690372 ± 0% -34.8% 12836548 ± 1% softirqs.TIMER 2442 ± 1% -28.9% 1737 ± 5% vmstat.procs.b 3016 ± 3% +19.4% 3603 ± 4% vmstat.procs.r 104330 ± 1% +34.6% 140387 ± 0% vmstat.system.in 22172 ± 0% +48.3% 32877 ± 2% vmstat.system.cs 1891 ± 12% -48.2% 978 ± 10% numa-numastat.node0.other_node 1785 ± 14% -47.7% 933 ± 6% numa-numastat.node1.other_node 1790 ± 12% -47.8% 935 ± 10% numa-numastat.node2.other_node 1766 ± 14% -47.0% 935 ± 12% numa-numastat.node3.other_node 426 ± 1% -48.4% 219 ± 1% time.elapsed_time.max 426 ± 1% -48.4% 219 ± 1% time.elapsed_time 5538139 ± 0% -12.1% 4867884 ± 0% time.involuntary_context_switches 44174 ± 1% -46.0% 23848 ± 1% time.system_time 2881944 ± 2% -35.1% 1870386 ± 8% time.voluntary_context_switches 7831898 ± 4% +31.8% 10325919 ± 5% meminfo.Active 7742498 ± 4% +32.2% 10237222 ± 5% meminfo.Active(anon) 7231211 ± 4% +28.7% 9308183 ± 5% meminfo.AnonPages 7.55e+11 ± 4% +19.6% 9.032e+11 ± 8% meminfo.Committed_AS 14010 ± 1% -17.4% 11567 ± 1% meminfo.Inactive(anon) 668946 ± 4% +40.8% 941815 ± 27% meminfo.PageTables 15392 ± 1% -15.9% 12945 ± 1% meminfo.Shmem 1185 ± 0% -4.4% 1133 ± 0% turbostat.Avg_MHz 3.29 ± 6% -64.5% 1.17 ± 14% turbostat.CPU%c1 0.10 ± 12% -90.3% 0.01 ± 0% turbostat.CPU%c3 2.95 ± 3% +73.9% 5.13 ± 3% turbostat.CPU%c6 743 ± 9% -70.7% 217 ± 17% turbostat.CorWatt 300 ± 0% -9.4% 272 ± 0% turbostat.PKG_% 1.58 ± 2% +59.6% 2.53 ± 20% turbostat.Pkg%pc2 758 ± 9% -69.3% 232 ± 16% turbostat.PkgWatt 15.08 ± 0% +5.4% 15.90 ± 1% turbostat.RAMWatt 105729 ± 6% -47.0% 56005 ± 25% cpuidle.C1-IVT-4S.usage 2.535e+08 ± 12% -62.7% 94532092 ± 22% cpuidle.C1-IVT-4S.time 4.386e+08 ± 4% -79.4% 90246312 ± 23% cpuidle.C1E-IVT-4S.time 83425 ± 6% -71.7% 23571 ± 23% cpuidle.C1E-IVT-4S.usage 14237 ± 8% -79.0% 2983 ± 19% cpuidle.C3-IVT-4S.usage 1.242e+08 ± 7% -87.5% 15462238 ± 18% cpuidle.C3-IVT-4S.time 87857 ± 7% -71.1% 25355 ± 5% cpuidle.C6-IVT-4S.usage 2.359e+09 ± 2% -38.2% 1.458e+09 ± 2% cpuidle.C6-IVT-4S.time 1960460 ± 3% +31.7% 2582336 ± 4% proc-vmstat.nr_active_anon 5548 ± 2% +53.2% 8498 ± 3% proc-vmstat.nr_alloc_batch 1830492 ± 3% +28.4% 2349846 ± 3% proc-vmstat.nr_anon_pages 3514 ± 1% -17.7% 2893 ± 1% proc-vmstat.nr_inactive_anon 168712 ± 4% +40.3% 236768 ± 27% proc-vmstat.nr_page_table_pages 3859 ± 1% -16.1% 3238 ± 1% proc-vmstat.nr_shmem 1997823 ± 5% -27.4% 1450005 ± 5% proc-vmstat.numa_hint_faults 1413076 ± 6% -25.3% 1056268 ± 5% proc-vmstat.numa_hint_faults_local 7213 ± 6% -47.3% 3799 ± 7% proc-vmstat.numa_other 406056 ± 3% -41.9% 236064 ± 6% proc-vmstat.numa_pages_migrated 7242333 ± 3% -29.2% 5130788 ± 10% proc-vmstat.numa_pte_updates 406056 ± 3% -41.9% 236064 ± 6% proc-vmstat.pgmigrate_success 484141 ± 3% +32.7% 642529 ± 5% numa-vmstat.node0.nr_active_anon 1.509e+08 ± 0% -12.6% 1.319e+08 ± 3% numa-vmstat.node0.numa_hit 452041 ± 3% +29.9% 587214 ± 5% numa-vmstat.node0.nr_anon_pages 1484 ± 1% +36.5% 2026 ± 24% numa-vmstat.node0.nr_alloc_batch 1.509e+08 ± 0% -12.6% 1.319e+08 ± 3% numa-vmstat.node0.numa_local 493672 ± 8% +30.5% 644195 ± 11% numa-vmstat.node1.nr_active_anon 1481 ± 9% +52.5% 2259 ± 8% numa-vmstat.node1.nr_alloc_batch 462466 ± 8% +27.4% 589287 ± 10% numa-vmstat.node1.nr_anon_pages 485463 ± 6% +29.1% 626539 ± 4% numa-vmstat.node2.nr_active_anon 422 ± 15% -63.1% 156 ± 38% numa-vmstat.node2.nr_inactive_anon 32587 ± 9% +71.0% 55722 ± 32% numa-vmstat.node2.nr_page_table_pages 1365 ± 5% +68.7% 2303 ± 11% numa-vmstat.node2.nr_alloc_batch 453583 ± 6% +26.1% 572097 ± 4% numa-vmstat.node2.nr_anon_pages 1.378e+08 ± 2% -8.5% 1.26e+08 ± 2% numa-vmstat.node3.numa_local 441345 ± 10% +28.4% 566740 ± 6% numa-vmstat.node3.nr_anon_pages 1.378e+08 ± 2% -8.5% 1.261e+08 ± 2% numa-vmstat.node3.numa_hit 471252 ± 10% +31.9% 621440 ± 7% numa-vmstat.node3.nr_active_anon 1359 ± 4% +75.1% 2380 ± 16% numa-vmstat.node3.nr_alloc_batch 1826489 ± 0% +30.0% 2375174 ± 4% numa-meminfo.node0.AnonPages 2774145 ± 8% +26.1% 3497281 ± 9% numa-meminfo.node0.MemUsed 1962338 ± 0% +32.5% 2599292 ± 4% numa-meminfo.node0.Active(anon) 1985987 ± 0% +32.0% 2621356 ± 4% numa-meminfo.node0.Active 2768321 ± 6% +27.7% 3534224 ± 11% numa-meminfo.node1.MemUsed 1935382 ± 5% +34.2% 2597532 ± 11% numa-meminfo.node1.Active 1913696 ± 5% +34.6% 2575266 ± 11% numa-meminfo.node1.Active(anon) 1784346 ± 6% +31.7% 2349891 ± 10% numa-meminfo.node1.AnonPages 1678 ± 15% -62.7% 625 ± 39% numa-meminfo.node2.Inactive(anon) 2532834 ± 4% +27.4% 3227116 ± 8% numa-meminfo.node2.MemUsed 132885 ± 9% +67.9% 223159 ± 32% numa-meminfo.node2.PageTables 2004439 ± 5% +26.1% 2528019 ± 5% numa-meminfo.node2.Active 1856674 ± 5% +23.0% 2283461 ± 5% numa-meminfo.node2.AnonPages 1981962 ± 5% +26.4% 2505422 ± 5% numa-meminfo.node2.Active(anon) 1862203 ± 8% +33.0% 2476954 ± 6% numa-meminfo.node3.Active(anon) 1883841 ± 7% +32.6% 2498686 ± 6% numa-meminfo.node3.Active 2572461 ± 11% +24.2% 3195556 ± 8% numa-meminfo.node3.MemUsed 1739646 ± 8% +29.4% 2250696 ± 6% numa-meminfo.node3.AnonPages Best Regards, Huang, Ying -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/