remove ZERO_PAGE

Linux Kernel Mailing List Tue, 16 Oct 2007 11:04:14 -0700

Gitweb:     
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=557ed1fa2620dc119adb86b34c614e152a629a80
Commit:     557ed1fa2620dc119adb86b34c614e152a629a80
Parent:     aadb4bc4a1f9108c1d0fbd121827c936c2ed4217
Author:     Nick Piggin <[EMAIL PROTECTED]>
AuthorDate: Tue Oct 16 01:24:40 2007 -0700
Committer:  Linus Torvalds <[EMAIL PROTECTED]>
CommitDate: Tue Oct 16 09:42:53 2007 -0700


    remove ZERO_PAGE
    
    The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
    
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
      (and thus mapcounted and count towards shared rss).  These writes to
      the struct page could cause excessive cacheline bouncing on big
      systems.  There are a number of ways this could be addressed if it is
      an issue.
    
    And indeed this cacheline bouncing has shown up on large SGI systems.
    There was a situation where an Altix system was essentially livelocked
    tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
    This situation can be avoided in userspace, but it does highlight the
    potential scalability problem with refcounting ZERO_PAGE, and corner
    cases where it can really hurt (we don't want the system to livelock!).
    
    There are several broad ways to fix this problem:
    1. add back some special casing to avoid refcounting ZERO_PAGE
    2. per-node or per-cpu ZERO_PAGES
    3. remove the ZERO_PAGE completely
    
    I will argue for 3. The others should also fix the problem, but they
    result in more complex code than does 3, with little or no real benefit
    that I can see.
    
    Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
    false optimisation: if an application is performance critical, it would
    not be doing many read faults of new memory, or at least it could be
    expected to write to that memory soon afterwards. If cache or memory use
    is critical, it should not be working with a significant number of
    ZERO_PAGEs anyway (a more compact representation of zeroes should be
    used).
    
    As a sanity check -- mesuring on my desktop system, there are never many
    mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
    increase much without it.
    
    When running a make -j4 kernel compile on my dual core system, there are
    about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
    ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
    is torn down without being COWed). So removing ZERO_PAGE will save 1,000
    page faults per second when running kbuild, while keeping it only saves
    less than 1 page clearing operation per second. 1 page clear is cheaper
    than a thousand faults, presumably, so there isn't an obvious loss.
    
    Neither the logical argument nor these basic tests give a guarantee of no
    regressions. However, this is a reasonable opportunity to try to remove
    the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
    we can reintroduce it and just avoid refcounting it.
    
    The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
    much use to them except on benchmarks.  All other users of ZERO_PAGE are
    converted just to use ZERO_PAGE(0) for simplicity. We can look at
    replacing them all and maybe ripping out ZERO_PAGE completely when we are
    more satisfied with this solution.
    
    Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
    Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
    Signed-off-by: Linus "snif" Torvalds <[EMAIL PROTECTED]>
---
 drivers/char/mem.c    |  125 ++++++-----------------------------------
 fs/binfmt_elf.c       |    2 +-
 fs/binfmt_elf_fdpic.c |    2 +-
 fs/direct-io.c        |    4 +-
 include/linux/mm.h    |    2 -
 mm/memory.c           |  151 +++++++------------------------------------------
 6 files changed, 42 insertions(+), 244 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index bbee97f..64551ab 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -625,65 +625,10 @@ static ssize_t splice_write_null(struct pipe_inode_info 
*pipe,struct file *out,
        return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null);
 }
 
-#ifdef CONFIG_MMU
-/*
- * For fun, we are using the MMU for this.
- */
-static inline size_t read_zero_pagealigned(char __user * buf, size_t size)
-{
-       struct mm_struct *mm;
-       struct vm_area_struct * vma;
-       unsigned long addr=(unsigned long)buf;
-
-       mm = current->mm;
-       /* Oops, this was forgotten before. -ben */
-       down_read(&mm->mmap_sem);
-
-       /* For private mappings, just map in zero pages. */
-       for (vma = find_vma(mm, addr); vma; vma = vma->vm_next) {
-               unsigned long count;
-
-               if (vma->vm_start > addr || (vma->vm_flags & VM_WRITE) == 0)
-                       goto out_up;
-               if (vma->vm_flags & (VM_SHARED | VM_HUGETLB))
-                       break;
-               count = vma->vm_end - addr;
-               if (count > size)
-                       count = size;
-
-               zap_page_range(vma, addr, count, NULL);
-               if (zeromap_page_range(vma, addr, count, PAGE_COPY))
-                       break;
-
-               size -= count;
-               buf += count;
-               addr += count;
-               if (size == 0)
-                       goto out_up;
-       }
-
-       up_read(&mm->mmap_sem);
-       
-       /* The shared case is hard. Let's do the conventional zeroing. */ 
-       do {
-               unsigned long unwritten = clear_user(buf, PAGE_SIZE);
-               if (unwritten)
-                       return size + unwritten - PAGE_SIZE;
-               cond_resched();
-               buf += PAGE_SIZE;
-               size -= PAGE_SIZE;
-       } while (size);
-
-       return size;
-out_up:
-       up_read(&mm->mmap_sem);
-       return size;
-}
-
 static ssize_t read_zero(struct file * file, char __user * buf, 
                         size_t count, loff_t *ppos)
 {
-       unsigned long left, unwritten, written = 0;
+       size_t written;
 
        if (!count)
                return 0;
@@ -691,69 +636,33 @@ static ssize_t read_zero(struct file * file, char __user 
* buf,
        if (!access_ok(VERIFY_WRITE, buf, count))
                return -EFAULT;
 
-       left = count;
-
-       /* do we want to be clever? Arbitrary cut-off */
-       if (count >= PAGE_SIZE*4) {
-               unsigned long partial;
+       written = 0;
+       while (count) {
+               unsigned long unwritten;
+               size_t chunk = count;
 
-               /* How much left of the page? */
-               partial = (PAGE_SIZE-1) & -(unsigned long) buf;
-               unwritten = clear_user(buf, partial);
-               written = partial - unwritten;
-               if (unwritten)
-                       goto out;
-               left -= partial;
-               buf += partial;
-               unwritten = read_zero_pagealigned(buf, left & PAGE_MASK);
-               written += (left & PAGE_MASK) - unwritten;
+               if (chunk > PAGE_SIZE)
+                       chunk = PAGE_SIZE;      /* Just for latency reasons */
+               unwritten = clear_user(buf, chunk);
+               written += chunk - unwritten;
                if (unwritten)
-                       goto out;
-               buf += left & PAGE_MASK;
-               left &= ~PAGE_MASK;
-       }
-       unwritten = clear_user(buf, left);
-       written += left - unwritten;
-out:
-       return written ? written : -EFAULT;
-}
-
-static int mmap_zero(struct file * file, struct vm_area_struct * vma)
-{
-       int err;
-
-       if (vma->vm_flags & VM_SHARED)
-               return shmem_zero_setup(vma);
-       err = zeromap_page_range(vma, vma->vm_start,
-                       vma->vm_end - vma->vm_start, vma->vm_page_prot);
-       BUG_ON(err == -EEXIST);
-       return err;
-}
-#else /* CONFIG_MMU */
-static ssize_t read_zero(struct file * file, char * buf, 
-                        size_t count, loff_t *ppos)
-{
-       size_t todo = count;
-
-       while (todo) {
-               size_t chunk = todo;
-
-               if (chunk > 4096)
-                       chunk = 4096;   /* Just for latency reasons */
-               if (clear_user(buf, chunk))
-                       return -EFAULT;
+                       break;
                buf += chunk;
-               todo -= chunk;
+               count -= chunk;
                cond_resched();
        }
-       return count;
+       return written ? written : -EFAULT;
 }
 
 static int mmap_zero(struct file * file, struct vm_area_struct * vma)
 {
+#ifndef CONFIG_MMU
        return -ENOSYS;
+#endif
+       if (vma->vm_flags & VM_SHARED)
+               return shmem_zero_setup(vma);
+       return 0;
 }
-#endif /* CONFIG_MMU */
 
 static ssize_t write_full(struct file * file, const char __user * buf,
                          size_t count, loff_t *ppos)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b1013f3..f3037c6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1725,7 +1725,7 @@ static int elf_core_dump(long signr, struct pt_regs 
*regs, struct file *file)
                                                &page, &vma) <= 0) {
                                DUMP_SEEK(PAGE_SIZE);
                        } else {
-                               if (page == ZERO_PAGE(addr)) {
+                               if (page == ZERO_PAGE(0)) {
                                        if (!dump_seek(file, PAGE_SIZE)) {
                                                page_cache_release(page);
                                                goto end_coredump;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 2f5d8db..c5ca2f0 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1488,7 +1488,7 @@ static int elf_fdpic_dump_segments(struct file *file, 
size_t *size,
                                           &page, &vma) <= 0) {
                                DUMP_SEEK(file->f_pos + PAGE_SIZE);
                        }
-                       else if (page == ZERO_PAGE(addr)) {
+                       else if (page == ZERO_PAGE(0)) {
                                page_cache_release(page);
                                DUMP_SEEK(file->f_pos + PAGE_SIZE);
                        }
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b5928a7..acf0da1 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -163,7 +163,7 @@ static int dio_refill_pages(struct dio *dio)
        up_read(&current->mm->mmap_sem);
 
        if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
-               struct page *page = ZERO_PAGE(dio->curr_user_address);
+               struct page *page = ZERO_PAGE(0);
                /*
                 * A memory fault, but the filesystem has some outstanding
                 * mapped blocks.  We need to use those blocks up to avoid
@@ -763,7 +763,7 @@ static void dio_zero_block(struct dio *dio, int end)
 
        this_chunk_bytes = this_chunk_blocks << dio->blkbits;
 
-       page = ZERO_PAGE(dio->curr_user_address);
+       page = ZERO_PAGE(0);
        if (submit_page_section(dio, page, 0, this_chunk_bytes, 
                                dio->next_block_for_io))
                return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 291c4cc..fbbc29a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -779,8 +779,6 @@ void free_pgtables(struct mmu_gather **tlb, struct 
vm_area_struct *start_vma,
                unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
                        struct vm_area_struct *vma);
-int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
-                       unsigned long size, pgprot_t prot);
 void unmap_mapping_range(struct address_space *mapping,
                loff_t const holebegin, loff_t const holelen, int even_cows);
 
diff --git a/mm/memory.c b/mm/memory.c
index f82b359..2a84308 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -966,7 +966,7 @@ no_page_table:
         * has touched so far, we don't want to allocate page tables.
         */
        if (flags & FOLL_ANON) {
-               page = ZERO_PAGE(address);
+               page = ZERO_PAGE(0);
                if (flags & FOLL_GET)
                        get_page(page);
                BUG_ON(flags & FOLL_WRITE);
@@ -1111,95 +1111,6 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
-static int zeromap_pte_range(struct mm_struct *mm, pmd_t *pmd,
-                       unsigned long addr, unsigned long end, pgprot_t prot)
-{
-       pte_t *pte;
-       spinlock_t *ptl;
-       int err = 0;
-
-       pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
-       if (!pte)
-               return -EAGAIN;
-       arch_enter_lazy_mmu_mode();
-       do {
-               struct page *page = ZERO_PAGE(addr);
-               pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
-
-               if (unlikely(!pte_none(*pte))) {
-                       err = -EEXIST;
-                       pte++;
-                       break;
-               }
-               page_cache_get(page);
-               page_add_file_rmap(page);
-               inc_mm_counter(mm, file_rss);
-               set_pte_at(mm, addr, pte, zero_pte);
-       } while (pte++, addr += PAGE_SIZE, addr != end);
-       arch_leave_lazy_mmu_mode();
-       pte_unmap_unlock(pte - 1, ptl);
-       return err;
-}
-
-static inline int zeromap_pmd_range(struct mm_struct *mm, pud_t *pud,
-                       unsigned long addr, unsigned long end, pgprot_t prot)
-{
-       pmd_t *pmd;
-       unsigned long next;
-       int err;
-
-       pmd = pmd_alloc(mm, pud, addr);
-       if (!pmd)
-               return -EAGAIN;
-       do {
-               next = pmd_addr_end(addr, end);
-               err = zeromap_pte_range(mm, pmd, addr, next, prot);
-               if (err)
-                       break;
-       } while (pmd++, addr = next, addr != end);
-       return err;
-}
-
-static inline int zeromap_pud_range(struct mm_struct *mm, pgd_t *pgd,
-                       unsigned long addr, unsigned long end, pgprot_t prot)
-{
-       pud_t *pud;
-       unsigned long next;
-       int err;
-
-       pud = pud_alloc(mm, pgd, addr);
-       if (!pud)
-               return -EAGAIN;
-       do {
-               next = pud_addr_end(addr, end);
-               err = zeromap_pmd_range(mm, pud, addr, next, prot);
-               if (err)
-                       break;
-       } while (pud++, addr = next, addr != end);
-       return err;
-}
-
-int zeromap_page_range(struct vm_area_struct *vma,
-                       unsigned long addr, unsigned long size, pgprot_t prot)
-{
-       pgd_t *pgd;
-       unsigned long next;
-       unsigned long end = addr + size;
-       struct mm_struct *mm = vma->vm_mm;
-       int err;
-
-       BUG_ON(addr >= end);
-       pgd = pgd_offset(mm, addr);
-       flush_cache_range(vma, addr, end);
-       do {
-               next = pgd_addr_end(addr, end);
-               err = zeromap_pud_range(mm, pgd, addr, next, prot);
-               if (err)
-                       break;
-       } while (pgd++, addr = next, addr != end);
-       return err;
-}
-
 pte_t * fastcall get_locked_pte(struct mm_struct *mm, unsigned long addr, 
spinlock_t **ptl)
 {
        pgd_t * pgd = pgd_offset(mm, addr);
@@ -1717,16 +1628,11 @@ gotten:
 
        if (unlikely(anon_vma_prepare(vma)))
                goto oom;
-       if (old_page == ZERO_PAGE(address)) {
-               new_page = alloc_zeroed_user_highpage_movable(vma, address);
-               if (!new_page)
-                       goto oom;
-       } else {
-               new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
-               if (!new_page)
-                       goto oom;
-               cow_user_page(new_page, old_page, address, vma);
-       }
+       VM_BUG_ON(old_page == ZERO_PAGE(0));
+       new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+       if (!new_page)
+               goto oom;
+       cow_user_page(new_page, old_page, address, vma);
 
        /*
         * Re-check the pte - we dropped the lock
@@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, 
struct vm_area_struct *vma,
        spinlock_t *ptl;
        pte_t entry;
 
-       if (write_access) {
-               /* Allocate our own private page. */
-               pte_unmap(page_table);
-
-               if (unlikely(anon_vma_prepare(vma)))
-                       goto oom;
-               page = alloc_zeroed_user_highpage_movable(vma, address);
-               if (!page)
-                       goto oom;
-
-               entry = mk_pte(page, vma->vm_page_prot);
-               entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+       /* Allocate our own private page. */
+       pte_unmap(page_table);
 
-               page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-               if (!pte_none(*page_table))
-                       goto release;
-               inc_mm_counter(mm, anon_rss);
-               lru_cache_add_active(page);
-               page_add_new_anon_rmap(page, vma, address);
-       } else {
-               /* Map the ZERO_PAGE - vm_page_prot is readonly */
-               page = ZERO_PAGE(address);
-               page_cache_get(page);
-               entry = mk_pte(page, vma->vm_page_prot);
+       if (unlikely(anon_vma_prepare(vma)))
+               goto oom;
+       page = alloc_zeroed_user_highpage_movable(vma, address);
+       if (!page)
+               goto oom;
 
-               ptl = pte_lockptr(mm, pmd);
-               spin_lock(ptl);
-               if (!pte_none(*page_table))
-                       goto release;
-               inc_mm_counter(mm, file_rss);
-               page_add_file_rmap(page);
-       }
+       entry = mk_pte(page, vma->vm_page_prot);
+       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
+       page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+       if (!pte_none(*page_table))
+               goto release;
+       inc_mm_counter(mm, anon_rss);
+       lru_cache_add_active(page);
+       page_add_new_anon_rmap(page, vma, address);
        set_pte_at(mm, address, page_table, entry);
 
        /* No need to invalidate - it was non-present before */
-
To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

remove ZERO_PAGE

Reply via email to