On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote: > Currently task_numa_work scans up to numa_balancing_scan_size_mb worth > of memory per invocation, but only counts memory areas that have at > least one PTE that is still present and not marked for numa hint faulting. > > It will skip over arbitarily large amounts of memory that are either > unused, full of swap ptes, or full of PTEs that were already marked > for NUMA hint faults but have not been faulted on yet. >
This was deliberate and intended to cover a case whereby a process sparsely using the address space would quickly skip over the sparse portions and reach the active portions. Obviously you've found that this is not always a great idea. > This can cause excessive amounts of CPU use, due to there being > essentially no upper limit on the scan rate of very large processes > that are not yet in a phase where they are actively accessing old > memory pages (eg. they are still initializing their data). > > Avoid that problem by placing an upper limit on the amount of virtual > memory that task_numa_work scans in each invocation. This can be a > higher limit than "pages", to ensure the task still skips over unused > areas fairly quickly. > > While we are here, also fix the "nr_pte_updates" logic, so it only > counts page ranges with ptes in them. > > Signed-off-by: Rik van Riel <[email protected]> > Reported-by: Andrea Arcangeli <[email protected]> > Reported-by: Jan Stancek <[email protected]> > --- > kernel/sched/fair.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6e2e3483b1ec..ff51b559ccaf 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work) > struct vm_area_struct *vma; > unsigned long start, end; > unsigned long nr_pte_updates = 0; > - long pages; > + long pages, virtpages; > > WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work)); > > @@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work) > start = mm->numa_scan_offset; > pages = sysctl_numa_balancing_scan_size; > pages <<= 20 - PAGE_SHIFT; /* MB in pages */ > + virtpages = pages * 8; /* Scan up to this much virtual space */ > if (!pages) > return; > > + > down_read(&mm->mmap_sem); > vma = find_vma(mm, start); > if (!vma) { > @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work) > start = max(start, vma->vm_start); > end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); > end = min(end, vma->vm_end); > - nr_pte_updates += change_prot_numa(vma, start, end); > + nr_pte_updates = change_prot_numa(vma, start, end); > Are you *sure* about this particular change? The intent is that sparse space be skipped until the first updated PTE is found and then scan sysctl_numa_balancing_scan_size pages after that. With this change, if we find a single PTE in the middle of a sparse space than we stop updating pages in the nr_pte_updates check below. You get protected from a lot of scanning by the virtpages check but it does not seem this fix is necessary. It has an odd side-effect whereby we possible scan more with this patch in some cases. > /* > - * Scan sysctl_numa_balancing_scan_size but ensure that > - * at least one PTE is updated so that unused virtual > - * address space is quickly skipped. > + * Try to scan sysctl_numa_balancing_size worth of > + * hpages that have at least one present PTE that > + * is not already pte-numa. If the VMA contains > + * areas that are unused or already full of prot_numa > + * PTEs, scan up to virtpages, to skip through those > + * areas faster. > */ > if (nr_pte_updates) > pages -= (end - start) >> PAGE_SHIFT; > + virtpages -= (end - start) >> PAGE_SHIFT; > It's a pity there will potentially be a lot of useless dead scanning on those processes but caching start addresses is both outside the scope of this patch and has its own problems. > start = end; > - if (pages <= 0) > + if (pages <= 0 || virtpages <= 0) > goto out; > > cond_resched(); -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

