On Thu, Sep 10, 2015 at 12:54:08PM +0200, Vlastimil Babka wrote:
> On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> >     CPU0                                    CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> >     compound_head()
> >       !!PageTail() == true
> >                                     put_page()
> >                                       tail->first_page = NULL
> >       head = tail->first_page
> >                                     alloc_pages(__GFP_COMP)
> >                                        prep_compound_page()
> >                                          tail->first_page = head
> >                                          __SetPageTail(p);
> >       !!PageTail() == true
> >     <head == NULL dereferencing>
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
> > the rest bits are pointer to head page if bit zero is set.
> > 
> > The patch moves page->pmd_huge_pte out of word, just in case if an
> > architecture defines pgtable_t into something what can have the bit 0
> > set.
> > 
> > hugetlb_cgroup uses page->lru.next in the second tail page to store
> > pointer struct hugetlb_cgroup. The patch switch it to use page->private
> > in the second tail page instead. The space is free since ->first_page is
> > removed from the union.
> > 
> > The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
> > limitation, since there's now space in first tail page to store struct
> > hugetlb_cgroup pointer. But that's out of scope of the patch.
> > 
> > That means page->compound_head shares storage space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word.
> 
> Given the discussion about rcu_head, that should warrant some summary here :)

Agreed.

> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -120,7 +120,13 @@ struct page {
> >             };
> >     };
> >  
> > -   /* Third double word block */
> > +   /*
> > +    * Third double word block
> > +    *
> > +    * WARNING: bit 0 of the first word encode PageTail(). That means
> > +    * the rest users of the storage space MUST NOT use the bit to
> > +    * avoid collision and false-positive PageTail().
> > +    */
> >     union {
> >             struct list_head lru;   /* Pageout list, eg. active_list
> >                                      * protected by zone->lru_lock !
> > @@ -143,12 +149,19 @@ struct page {
> >                                              */
> >             /* First tail page of compound page */
> 
> "First tail" doesn't apply for compound_head.

I'll adjust comments.
 
> > index 097c7a4bfbd9..330377f83ac7 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page 
> > *page,
> >                                   (1L << PG_unevictable)));
> >             page_tail->flags |= (1L << PG_dirty);
> >  
> > -           /* clear PageTail before overwriting first_page */
> > -           smp_wmb();
> > +           clear_compound_head(page_tail);
> 
> I would sleep better if this was done before setting all the page->flags 
> above,
> previously, PageTail was cleared by the first operation which is
> "page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
> I do realize that it doesn't use WRITE_ONCE, so it might have been 
> theoretically
> broken already, if it does matter.

Right. Nothing enforces particular order. If we really need to have some
ordering on PageTail() vs. page->flags let's define it, but so far I
don't see a reason to change this part.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 36b23f1e2ca6..89e21a07080a 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
> >      * speculative page access (like in
> >      * page_cache_get_speculative()) on tail pages.
> >      */
> > -   VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> > +   VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
> >     if (get_page_head)
> > -           atomic_inc(&page->first_page->_count);
> > +           atomic_inc(&compound_head(page)->_count);
> 
> Doing another compound_head() seems like overkill when this code already 
> assumes
> PageTail.

"Overkill"? It's too strong wording for re-read hot cache line.

> All callers do it after if (PageTail()) which means they already did
> READ_ONCE(page->compound_head) and here they do another one. Potentially with
> different result in bit 0, which would be a subtle bug, that could be
> interesting to catch with some VM_BUG_ON. I don't know if a direct plain
> page->compound_head access here instead of compound_head() would also result 
> in
> a re-read, since the previous access did use READ_ONCE(). Maybe it would be 
> best
> to reorganize the code here and in the 3 call sites so that the READ_ONCE() 
> used
> to determine PageTail also obtains the compound head pointer.

All we would possbily win by the change is few bytes in code. Additional
READ_ONCE() only affect code generation. It doesn't introduce any cpu
barriers. The cache line with compound_head is in L1 anyway.

I don't see justification to change this part too. If you think we can
gain something by reworking this code, feel free to propose patch on top.

> Some of that is probably made moot by your other series, but better let's 
> think
> of this series as standalone first.
> 
> >     get_huge_page_tail(page);
> >  }
> >  
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 1f4446a90cef..4d1a5de9653d 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long 
> > pfn)
> >  #define lru                (1UL << PG_lru)
> >  #define swapbacked (1UL << PG_swapbacked)
> >  #define head               (1UL << PG_head)
> > -#define tail               (1UL << PG_tail)
> > -#define compound   (1UL << PG_compound)
> >  #define slab               (1UL << PG_slab)
> >  #define reserved   (1UL << PG_reserved)
> >  
> > @@ -811,12 +809,7 @@ static struct page_state {
> >      */
> >     { slab,         slab,           MF_MSG_SLAB,    me_kernel },
> >  
> > -#ifdef CONFIG_PAGEFLAGS_EXTENDED
> >     { head,         head,           MF_MSG_HUGE,            me_huge_page },
> > -   { tail,         tail,           MF_MSG_HUGE,            me_huge_page },
> > -#else
> > -   { compound,     compound,       MF_MSG_HUGE,            me_huge_page },
> > -#endif
> >  
> >     { sc|dirty,     sc|dirty,       MF_MSG_DIRTY_SWAPCACHE, 
> > me_swapcache_dirty },
> >     { sc|dirty,     sc,             MF_MSG_CLEAN_SWAPCACHE, 
> > me_swapcache_clean },
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c6733cc3cbce..a56ad53ff553 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -424,15 +424,15 @@ out:
> >  /*
> >   * Higher-order pages are called "compound pages".  They are structured 
> > thusly:
> >   *
> > - * The first PAGE_SIZE page is called the "head page".
> > + * The first PAGE_SIZE page is called the "head page" and have PG_head set.
> >   *
> > - * The remaining PAGE_SIZE pages are called "tail pages".
> > + * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is 
> > encoded
> > + * in bit 0 of page->compound_head. The rest of bits is pointer to head 
> > page.
> >   *
> > - * All pages have PG_compound set.  All tail pages have their ->first_page
> > - * pointing at the head page.
> > + * The first tail page's ->compound_dtor holds the offset in array of 
> > compound
> > + * page destructors. See compound_page_dtors.
> >   *
> > - * The first tail page's ->lru.next holds the address of the compound 
> > page's
> > - * put_page() function.  Its ->lru.prev holds the order of allocation.
> > + * The first tail page's ->compound_order holds the order of allocation.
> >   * This usage means that zero-order pages may not be compound.
> >   */
> >  
> > @@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned 
> > long order)
> >     for (i = 1; i < nr_pages; i++) {
> >             struct page *p = page + i;
> >             set_page_count(p, 0);
> > -           p->first_page = page;
> > -           /* Make sure p->first_page is always valid for PageTail() */
> > -           smp_wmb();
> > -           __SetPageTail(p);
> > +           set_compound_head(p, page);
> >     }
> >  }
> >  
> > @@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
> >  
> >  static int free_tail_pages_check(struct page *head_page, struct page *page)
> >  {
> > -   if (!IS_ENABLED(CONFIG_DEBUG_VM))
> > -           return 0;
> > +   int ret = 1;
> > +
> > +   /*
> > +    * We rely page->lru.next never has bit 0 set, unless the page
> > +    * is PageTail(). Let's make sure that's true even for poisoned ->lru.
> > +    */
> > +   BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
> > +
> > +   if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
> > +           ret = 0;
> > +           goto out;
> > +   }
> >     if (unlikely(!PageTail(page))) {
> >             bad_page(page, "PageTail not set", 0);
> > -           return 1;
> > +           goto out;
> >     }
> > -   if (unlikely(page->first_page != head_page)) {
> > -           bad_page(page, "first_page not consistent", 0);
> > -           return 1;
> > +   if (unlikely(compound_head(page) != head_page)) {
> > +           bad_page(page, "compound_head not consistent", 0);
> > +           goto out;
> 
> Same here, although for a DEBUG_VM config only it's not as important.

Ditto.

> 
> >     }
> > -   return 0;
> > +   ret = 0;
> > +out:
> > +   clear_compound_head(page);
> > +   return ret;
> >  }
> >  
> >  static void __meminit __init_single_page(struct page *page, unsigned long 
> > pfn,
> > @@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
> >  #else
> >  static inline void init_reserved_page(unsigned long pfn)
> >  {
> > +   /* Avoid false-positive PageTail() */
> > +   INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
> >  }
> >  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >  
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a3a0a2f1f7c3..faa9e1687dea 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -200,7 +200,7 @@ out_put_single:
> >                             __put_single_page(page);
> >                     return;
> >             }
> > -           VM_BUG_ON_PAGE(page_head != page->first_page, page);
> > +           VM_BUG_ON_PAGE(page_head != compound_head(page), page);
> >             /*
> >              * We can release the refcount taken by
> >              * get_page_unless_zero() now that
> > @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
> >      *  Case 3 is possible, as we may race with
> >      *  __split_huge_page_refcount tearing down a THP page.
> >      */
> > -   page_head = compound_head_by_tail(page);
> > +   page_head = compound_head(page);
> 
> This is also in a path after PageTail() is true.

We can only save one branch here: other PageTail() is most likely in other
compilation unit and compiler would not be able to eliminate additional
load.
Why bother?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to