On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
>> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index ca99c0ecf52e..172f4d8e798d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -265,6 +265,7 @@ struct fault_env {
>>>     pmd_t *pmd;
>>>     pte_t *pte;
>>>     spinlock_t *ptl;
>>> +   pgtable_t prealloc_pte;
>>>  };
>>
>> If we're going to do this fault_env thing, we need some heavy-duty
>> comments on what the different fields do and what they mean.  We don't
>> want to get in to a situation where we're doing
>>
>>      void fault_foo(struct fault_env *fe);..
>>
>> and then we change the internals of fault_foo() to manipulate a
>> different set of fe->* variables, or change assumptions, then have
>> callers randomly break.
>>
>> One _nice_ part of passing all the arguments explicitly is that it makes
>> you go visit all the call sites and think about how the conventions change.
>>
>> It just makes me nervous.
>>
>> The semantics of having both a ->pte and ->pmd need to be very clearly
>> spelled out too, please.
> 
> I've updated this to:
> 
> /*
>  * Page fault context: passes though page fault handler instead of endless 
> list
>  * of function arguments.
>  */
> struct fault_env {
>       struct vm_area_struct *vma;     /* Target VMA */
>       unsigned long address;          /* Faulting virtual address */
>       unsigned int flags;             /* FAULT_FLAG_xxx flags */
>       pmd_t *pmd;                     /* Pointer to pmd entry matching
>                                        * the 'address'
>                                        */

Is this just for huge PMDs, or does it also cover normal PMDs pointing
to PTE pages?  Is it populated every time we're at or below the PMD
during a fault?  Is it always valid?

>       pte_t *pte;                     /* Pointer to pte entry matching
>                                        * the 'address'. NULL if the page
>                                        * table hasn't been allocated.
>                                        */

What's the relationship between pmd and pte?  Can both be set at the
same time, etc...?

>       spinlock_t *ptl;                /* Page table lock.
>                                        * Protects pte page table if 'pte'
>                                        * is not NULL, otherwise pmd.
>                                        */

Are there any rules for callers when a callee puts a value in here?

>       pgtable_t prealloc_pte;         /* Pre-allocated pte page table.
>                                        * vm_ops->map_pages() calls
>                                        * do_set_pte() from atomic context.
>                                        * do_fault_around() pre-allocates
>                                        * page table to avoid allocation from
>                                        * atomic context.
>                                        */
> };

Who's responsible for freeing this and when?

>>>  /*
>>> @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
>>> vm_area_struct *vma)
>>>     return pte;
>>>  }
>>>  
>>> -void do_set_pte(struct fault_env *fe, struct page *page);
>>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
>>> +           struct page *page);
>>>  #endif
>>
>> I think do_set_pte() might be due for a new name if it's going to be
>> doing allocations internally.
> 
> Any suggestions?

alloc_set_pte() is probably fine.  Just make it clear early in some
comments that the allocation is conditional.

>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 28b3875969a8..ba8150d6dc33 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
>>>                     start_pgoff) {
>>>             if (iter.index > end_pgoff)
>>>                     break;
>>> -           fe->pte += iter.index - last_pgoff;
>>> -           fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
>>> -           last_pgoff = iter.index;
>>> -           if (!pte_none(*fe->pte))
>>> -                   goto next;
>>>  repeat:
>>>             page = radix_tree_deref_slot(slot);
>>>             if (unlikely(!page))
>>> @@ -2187,7 +2182,17 @@ repeat:
>>>  
>>>             if (file->f_ra.mmap_miss > 0)
>>>                     file->f_ra.mmap_miss--;
>>> -           do_set_pte(fe, page);
>>> +
>>> +           fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
>>> +           if (fe->pte)
>>> +                   fe->pte += iter.index - last_pgoff;
>>> +           last_pgoff = iter.index;
>>> +           if (do_set_pte(fe, NULL, page)) {
>>> +                   /* failed to setup page table: giving up */
>>> +                   if (!fe->pte)
>>> +                           break;
>>> +                   goto unlock;
>>> +           }
>>
>> What's the failure here, though?
> 
> At this point in patchset it never fails: allocation failure is not
> possible as we pre-allocate page table for faularound.
> 
> Later after do_set_pmd() is introduced, huge page can be mapped here. By
> us or under us.
> 
> I'll update comment.

So why check the return value of do_set_pte()?  Why can it return nonzero?

>> This also throws away the spiffy new error code that comes baqck from
>> do_set_pte().  Is that OK?
> 
> Yes. We will try harder in do_read_fault() once faultaround code failed to
> solve the page fault with all proper locks and error handling.

OK, I hope the new comment addresses this.

>>> +   /*
>>> +    * Use __pte_alloc instead of pte_alloc_map, because we can't
>>> +    * run pte_offset_map on the pmd, if an huge pmd could
>>> +    * materialize from under us from a different thread.
>>> +    */
>>
>> This comment is a little bit funky.  Maybe:
>>
>> "Use __pte_alloc() instead of pte_alloc_map().  We can't run
>> pte_offset_map() on pmds where a huge pmd might be created (from a
>> different thread)."
>>
>> Could you also talk a bit about where it _is_ safe to call pte_alloc_map()?
> 
> That comment was just moved from __handle_mm_fault().
> 
> Would this be okay:
> 
>         /*
>          * Use __pte_alloc() instead of pte_alloc_map().  We can't run
>          * pte_offset_map() on pmds where a huge pmd might be created (from
>          * a different thread).
>          *
>          * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
>          * parallel threads are excluded by other means.
>          */

Yep, that looks good.  Just make sure to make it clear that mmap_sem
isn't held in *this* context.

>>> +   if (unlikely(pmd_none(*fe->pmd) &&
>>> +                   __pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
>>> +           return VM_FAULT_OOM;
>>
>> Should we just move this pmd_none() check in to __pte_alloc()?  You do
>> this same-style check at least twice.
> 
> We have it there. The check here is speculative to avoid taking ptl.
> 
>>> +   /* If an huge pmd materialized from under us just retry later */
>>> +   if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +           return 0;
>>
>> Nit: please stop sprinkling unlikely() everywhere.  Is there some
>> concrete benefit to doing it here?  I really doubt the compiler needs
>> help putting the code for "return 0" out-of-line.
>>
>> Why is it important to abort here?  Is this a small-page-only path?
> 
> This unlikely() was moved from __handle_mm_fault(). I didn't put much
> consideration in it.
>  
>>> +static int pte_alloc_one_map(struct fault_env *fe)
>>> +{
>>> +   struct vm_area_struct *vma = fe->vma;
>>> +
>>> +   if (!pmd_none(*fe->pmd))
>>> +           goto map_pte;
>>
>> So the calling convention here is...?  It looks like this has to be
>> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
>> pte page.  This can never be called on a huge pmd.  Right?
> 
> It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> 'map_pte' goto path.
>  
>>> +   if (fe->prealloc_pte) {
>>> +           smp_wmb(); /* See comment in __pte_alloc() */
>>
>> Are we trying to make *this* cpu's write visible, or to see the write
>> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
>> smp_rmb() what we want for that?
> 
> See 362a61ad6119.
> 
> I think more logical way would be to put it into do_fault_around(), just after
> pte_alloc_one().
>  
>>> +           fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
>>> +           if (unlikely(!pmd_none(*fe->pmd))) {
>>> +                   spin_unlock(fe->ptl);
>>> +                   goto map_pte;
>>> +           }
>>
>> Should we just make pmd_none() likely()?  That seems like it would save
>> about 20MB of unlikely()'s in the source.
> 
> Heh.
> 
>>> +           atomic_long_inc(&vma->vm_mm->nr_ptes);
>>> +           pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
>>> +           spin_unlock(fe->ptl);
>>> +           fe->prealloc_pte = 0;
>>> +   } else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
>>> +                                   fe->address))) {
>>> +           return VM_FAULT_OOM;
>>> +   }
>>> +map_pte:
>>> +   if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +           return VM_FAULT_NOPAGE;
>>
>> I think I need a refresher on the locking rules.  pte_offset_map*() is
>> unsafe to call on a huge pmd.  What in this context makes it impossible
>> for the pmd to get promoted after the check?
> 
> Do you mean what stops pte page table to collapsed into huge pmd?
> The answer is mmap_sem. Collapse code takes the lock on write to be able to
> retract page table.
>  
>>> +   fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> +                   &fe->ptl);
>>> +   return 0;
>>> +}
>>> +
>>>  /**
>>>   * do_set_pte - setup new PTE entry for given page and add reverse page 
>>> mapping.
>>>   *
>>>   * @fe: fault environment
>>> + * @memcg: memcg to charge page (only for private mappings)
>>>   * @page: page to map
>>>   *
>>> - * Caller must hold page table lock relevant for @fe->pte.
>>
>> That's a bit screwy now because fe->pte might not exist.  Right?  I
> 
> [ you're commenting deleted line ]
> 
> Right.
> 
>> thought the ptl was derived from the physical address of the pte page.
>> How can we have a lock for a physical address that doesn't exist yet?
> 
> If fe->pte is NULL, pte_alloc_one_map() would take care about allocation, map
> and lock the page table.
>  
>>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on 
>>> return.
>>>   *
>>>   * Target users are page handler itself and implementations of
>>>   * vm_ops->map_pages.
>>>   */
>>> -void do_set_pte(struct fault_env *fe, struct page *page)
>>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
>>> +           struct page *page)
>>>  {
>>>     struct vm_area_struct *vma = fe->vma;
>>>     bool write = fe->flags & FAULT_FLAG_WRITE;
>>>     pte_t entry;
>>>  
>>> +   if (!fe->pte) {
>>> +           int ret = pte_alloc_one_map(fe);
>>> +           if (ret)
>>> +                   return ret;
>>> +   }
>>> +
>>> +   if (!pte_none(*fe->pte))
>>> +           return VM_FAULT_NOPAGE;
>>
>> Oh, you've got to add another pte_none() check because you're deferring
>> the acquisition of the ptl lock?
> 
> Yes, we need to re-check once ptl is taken.
> 
>>>     flush_icache_page(vma, page);
>>>     entry = mk_pte(page, vma->vm_page_prot);
>>>     if (write)
>>> @@ -2811,6 +2864,8 @@ void do_set_pte(struct fault_env *fe, struct page 
>>> *page)
>>>     if (write && !(vma->vm_flags & VM_SHARED)) {
>>>             inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>>>             page_add_new_anon_rmap(page, vma, fe->address, false);
>>> +           mem_cgroup_commit_charge(page, memcg, false, false);
>>> +           lru_cache_add_active_or_unevictable(page, vma);
>>>     } else {
>>>             inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>>>             page_add_file_rmap(page);
>>> @@ -2819,6 +2874,8 @@ void do_set_pte(struct fault_env *fe, struct page 
>>> *page)
>>>  
>>>     /* no need to invalidate: a not-present page won't be cached */
>>>     update_mmu_cache(vma, fe->address, fe->pte);
>>> +
>>> +   return 0;
>>>  }
>>>  
>>>  static unsigned long fault_around_bytes __read_mostly =
>>> @@ -2885,19 +2942,17 @@ late_initcall(fault_around_debugfs);
>>>   * fault_around_pages() value (and therefore to page order).  This way it's
>>>   * easier to guarantee that we don't cross page table boundaries.
>>>   */
>>> -static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>>> +static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>>>  {
>>> -   unsigned long address = fe->address, start_addr, nr_pages, mask;
>>> -   pte_t *pte = fe->pte;
>>> +   unsigned long address = fe->address, nr_pages, mask;
>>>     pgoff_t end_pgoff;
>>> -   int off;
>>> +   int off, ret = 0;
>>>  
>>>     nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
>>>     mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
>>>  
>>> -   start_addr = max(fe->address & mask, fe->vma->vm_start);
>>> -   off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
>>> -   fe->pte -= off;
>>> +   fe->address = max(address & mask, fe->vma->vm_start);
>>> +   off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
>>>     start_pgoff -= off;
>>
>> Considering what's in this patch already, I think I'd leave the trivial
>> local variable replacement for another patch.
> 
> fe->address is not a local variable: it get passed into map_pages.
> 
>>>     /*
>>> @@ -2905,30 +2960,33 @@ static void do_fault_around(struct fault_env *fe, 
>>> pgoff_t start_pgoff)
>>>      *  or fault_around_pages() from start_pgoff, depending what is nearest.
>>>      */
>>>     end_pgoff = start_pgoff -
>>> -           ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
>>> +           ((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
>>>             PTRS_PER_PTE - 1;
>>>     end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
>>>                     start_pgoff + nr_pages - 1);
>>>  
>>> -   /* Check if it makes any sense to call ->map_pages */
>>> -   fe->address = start_addr;
>>> -   while (!pte_none(*fe->pte)) {
>>> -           if (++start_pgoff > end_pgoff)
>>> -                   goto out;
>>> -           fe->address += PAGE_SIZE;
>>> -           if (fe->address >= fe->vma->vm_end)
>>> -                   goto out;
>>> -           fe->pte++;
>>> +   if (pmd_none(*fe->pmd))
>>> +           fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
>>> +   fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
>>> +   if (fe->prealloc_pte) {
>>> +           pte_free(fe->vma->vm_mm, fe->prealloc_pte);
>>> +           fe->prealloc_pte = 0;
>>>     }
>>> +   if (!fe->pte)
>>> +           goto out;
>>
>> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
>> present?
> 
> Huge pmd is mapped.
> 
> Comment added.
> 
>>> -   fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
>>> +   /* check if the page fault is solved */
>>> +   fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
>>> +   if (!pte_none(*fe->pte))
>>> +           ret = VM_FAULT_NOPAGE;
>>> +   pte_unmap_unlock(fe->pte, fe->ptl);
>>>  out:
>>> -   /* restore fault_env */
>>> -   fe->pte = pte;
>>>     fe->address = address;
>>> +   fe->pte = NULL;
>>> +   return ret;
>>>  }
>>>  
>>> -static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t 
>>> orig_pte)
>>> +static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
>>>  {
>>>     struct vm_area_struct *vma = fe->vma;
>>>     struct page *fault_page;
>>> @@ -2940,33 +2998,25 @@ static int do_read_fault(struct fault_env *fe, 
>>> pgoff_t pgoff, pte_t orig_pte)
>>>      * something).
>>>      */
>>>     if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
>>> -           fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> -                           &fe->ptl);
>>> -           do_fault_around(fe, pgoff);
>>> -           if (!pte_same(*fe->pte, orig_pte))
>>> -                   goto unlock_out;
>>> -           pte_unmap_unlock(fe->pte, fe->ptl);
>>> +           ret = do_fault_around(fe, pgoff);
>>> +           if (ret)
>>> +                   return ret;
>>>     }
>>>  
>>>     ret = __do_fault(fe, pgoff, NULL, &fault_page);
>>>     if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>>             return ret;
>>>  
>>> -   fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, 
>>> &fe->ptl);
>>> -   if (unlikely(!pte_same(*fe->pte, orig_pte))) {
>>> +   ret |= do_set_pte(fe, NULL, fault_page);
>>> +   if (fe->pte)
>>>             pte_unmap_unlock(fe->pte, fe->ptl);
>>> -           unlock_page(fault_page);
>>> -           page_cache_release(fault_page);
>>> -           return ret;
>>> -   }
>>> -   do_set_pte(fe, fault_page);
>>>     unlock_page(fault_page);
>>> -unlock_out:
>>> -   pte_unmap_unlock(fe->pte, fe->ptl);
>>> +   if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>> +           page_cache_release(fault_page);
>>>     return ret;
>>>  }
>>>  
>>> -static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t 
>>> orig_pte)
>>> +static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
>>>  {
>>>     struct vm_area_struct *vma = fe->vma;
>>>     struct page *fault_page, *new_page;
>>> @@ -2994,26 +3044,9 @@ static int do_cow_fault(struct fault_env *fe, 
>>> pgoff_t pgoff, pte_t orig_pte)
>>>             copy_user_highpage(new_page, fault_page, fe->address, vma);
>>>     __SetPageUptodate(new_page);
>>>  
>>> -   fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> -                   &fe->ptl);
>>> -   if (unlikely(!pte_same(*fe->pte, orig_pte))) {
>>> +   ret |= do_set_pte(fe, memcg, new_page);
>>> +   if (fe->pte)
>>>             pte_unmap_unlock(fe->pte, fe->ptl);
>>> -           if (fault_page) {
>>> -                   unlock_page(fault_page);
>>> -                   page_cache_release(fault_page);
>>> -           } else {
>>> -                   /*
>>> -                    * The fault handler has no page to lock, so it holds
>>> -                    * i_mmap_lock for read to protect against truncate.
>>> -                    */
>>> -                   i_mmap_unlock_read(vma->vm_file->f_mapping);
>>> -           }
>>> -           goto uncharge_out;
>>> -   }
>>> -   do_set_pte(fe, new_page);
>>> -   mem_cgroup_commit_charge(new_page, memcg, false, false);
>>> -   lru_cache_add_active_or_unevictable(new_page, vma);
>>> -   pte_unmap_unlock(fe->pte, fe->ptl);
>>>     if (fault_page) {
>>>             unlock_page(fault_page);
>>>             page_cache_release(fault_page);
>>> @@ -3024,6 +3057,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t 
>>> pgoff, pte_t orig_pte)
>>>              */
>>>             i_mmap_unlock_read(vma->vm_file->f_mapping);
>>>     }
>>> +   if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>> +           goto uncharge_out;
>>>     return ret;
>>>  uncharge_out:
>>>     mem_cgroup_cancel_charge(new_page, memcg, false);
>>> @@ -3031,7 +3066,7 @@ uncharge_out:
>>>     return ret;
>>>  }
>>>  
>>> -static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t 
>>> orig_pte)
>>> +static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
>>>  {
>>>     struct vm_area_struct *vma = fe->vma;
>>>     struct page *fault_page;
>>> @@ -3057,16 +3092,15 @@ static int do_shared_fault(struct fault_env *fe, 
>>> pgoff_t pgoff, pte_t orig_pte)
>>>             }
>>>     }
>>>  
>>> -   fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> -                   &fe->ptl);
>>> -   if (unlikely(!pte_same(*fe->pte, orig_pte))) {
>>> +   ret |= do_set_pte(fe, NULL, fault_page);
>>> +   if (fe->pte)
>>>             pte_unmap_unlock(fe->pte, fe->ptl);
>>> +   if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
>>> +                                   VM_FAULT_RETRY))) {
>>>             unlock_page(fault_page);
>>>             page_cache_release(fault_page);
>>>             return ret;
>>>     }
>>> -   do_set_pte(fe, fault_page);
>>> -   pte_unmap_unlock(fe->pte, fe->ptl);
>>>  
>>>     if (set_page_dirty(fault_page))
>>>             dirtied = 1;
>>> @@ -3098,21 +3132,19 @@ static int do_shared_fault(struct fault_env *fe, 
>>> pgoff_t pgoff, pte_t orig_pte)
>>>   * The mmap_sem may have been released depending on flags and our
>>>   * return value.  See filemap_fault() and __lock_page_or_retry().
>>>   */
>>> -static int do_fault(struct fault_env *fe, pte_t orig_pte)
>>> +static int do_fault(struct fault_env *fe)
>>>  {
>>>     struct vm_area_struct *vma = fe->vma;
>>> -   pgoff_t pgoff = (((fe->address & PAGE_MASK)
>>> -                   - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>>> +   pgoff_t pgoff = linear_page_index(vma, fe->address);
>>
>> Looks like another trivial cleanup.
> 
> Okay, I'll move it into separate patch.
> 
>>> -   pte_unmap(fe->pte);
>>>     /* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
>>>     if (!vma->vm_ops->fault)
>>>             return VM_FAULT_SIGBUS;
>>>     if (!(fe->flags & FAULT_FLAG_WRITE))
>>> -           return do_read_fault(fe, pgoff, orig_pte);
>>> +           return do_read_fault(fe, pgoff);
>>>     if (!(vma->vm_flags & VM_SHARED))
>>> -           return do_cow_fault(fe, pgoff, orig_pte);
>>> -   return do_shared_fault(fe, pgoff, orig_pte);
>>> +           return do_cow_fault(fe, pgoff);
>>> +   return do_shared_fault(fe, pgoff);
>>>  }
>>>  
>>>  static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>>> @@ -3252,37 +3284,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t 
>>> orig_pmd)
>>>   * with external mmu caches can use to update those (ie the Sparc or
>>>   * PowerPC hashed page tables that act as extended TLBs).
>>>   *
>>> - * We enter with non-exclusive mmap_sem (to exclude vma changes,
>>> - * but allow concurrent faults), and pte mapped but not yet locked.
>>> - * We return with pte unmapped and unlocked.
>>> + * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
>>> + * concurrent faults).
>>>   *
>>> - * The mmap_sem may have been released depending on flags and our
>>> - * return value.  See filemap_fault() and __lock_page_or_retry().
>>> + * The mmap_sem may have been released depending on flags and our return 
>>> value.
>>> + * See filemap_fault() and __lock_page_or_retry().
>>>   */
>>>  static int handle_pte_fault(struct fault_env *fe)
>>>  {
>>>     pte_t entry;
>>>  
>>> +   /* If an huge pmd materialized from under us just retry later */
>>> +   if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +           return 0;
>>> +
>>> +   if (unlikely(pmd_none(*fe->pmd))) {
>>> +           /*
>>> +            * Leave __pte_alloc() until later: because vm_ops->fault may
>>> +            * want to allocate huge page, and if we expose page table
>>> +            * for an instant, it will be difficult to retract from
>>> +            * concurrent faults and from rmap lookups.
>>> +            */
>>> +   } else {
>>> +           /*
>>> +            * A regular pmd is established and it can't morph into a huge
>>> +            * pmd from under us anymore at this point because we hold the
>>> +            * mmap_sem read mode and khugepaged takes it in write mode.
>>> +            * So now it's safe to run pte_offset_map().
>>> +            */
>>> +           fe->pte = pte_offset_map(fe->pmd, fe->address);
>>> +
>>> +           entry = *fe->pte;
>>> +           barrier();
>>
>> Barrier because....?
>>
>>> +           if (pte_none(entry)) {
>>> +                   pte_unmap(fe->pte);
>>> +                   fe->pte = NULL;
>>> +           }
>>> +   }
>>> +
>>>     /*
>>>      * some architectures can have larger ptes than wordsize,
>>>      * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
>>>      * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
>>> -    * The code below just needs a consistent view for the ifs and
>>> +    * The code above just needs a consistent view for the ifs and
>>>      * we later double check anyway with the ptl lock held. So here
>>>      * a barrier will do.
>>>      */
>>
>> Looks like the barrier got moved, but not the comment.
> 
> Moved.
> 
>> Man, that's a lot of code.
> 
> Yeah. I don't see a sensible way to split it. :-/
> 

Reply via email to