from:"Yang Shi"

Re: [PATCH 1/5] mm: rmap: fix cache flush on THP pages

2022-01-21 Thread Yang Shi

On Thu, Jan 20, 2022 at 11:56 PM Muchun Song  wrote:
>
> The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
> However, it does not cover the full pages in a THP except a head page.
> Replace it with flush_cache_range() to fix this issue. At least, no
> problems were found due to this. Maybe because the architectures that
> have virtual indexed caches is less.

Yeah, actually flush_cache_page()/flush_cache_range() are no-op for
the most architectures which have THP supported, i.e. x86, aarch64,
powerpc, etc.

And currently just tmpfs and read-only files support PMD-mapped THP,
but both don't have to do writeback. And it seems DAX doesn't have
writeback either, which uses __set_page_dirty_no_writeback() for
set_page_dirty. So this code should never be called IIUC.

But anyway your fix looks correct to me. Reviewed-by: Yang Shi


>
> Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use 
> page_vma_mapped_walk()")
> Signed-off-by: Muchun Song 
> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b0fd9dc19eba..65670cb805d6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -974,7 +974,7 @@ static bool page_mkclean_one(struct page *page, struct 
> vm_area_struct *vma,
> if (!pmd_dirty(*pmd) && !pmd_write(*pmd))
> continue;
>
> -   flush_cache_page(vma, address, page_to_pfn(page));
> +   flush_cache_range(vma, address, address + 
> HPAGE_PMD_SIZE);
> entry = pmdp_invalidate(vma, address, pmd);
> entry = pmd_wrprotect(entry);
> entry = pmd_mkclean(entry);
> --
> 2.11.0
>

Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-14 Thread Yang Shi

On Tue, Apr 13, 2021 at 8:00 PM Huang, Ying  wrote:
>
> Yang Shi  writes:
>
> > The generic migration path will check refcount, so no need check refcount 
> > here.
> > But the old code actually prevents from migrating shared THP (mapped by 
> > multiple
> > processes), so bail out early if mapcount is > 1 to keep the behavior.
>
> What prevents us from migrating shared THP?  If no, why not just remove
> the old refcount checking?

We could migrate shared THP if we don't care about the bounce back and
forth between nodes as Zi Yan described. The other reason is, as I
mentioned in the cover letter,  I'd like to keep the behavior as
consistent as possible between before and after for now. The old
behavior does prevent migrating shared THP, so I did so in this
series. We definitely could optimize the behavior later on.

>
> Best Regards,
> Huang, Ying
>
> > Signed-off-by: Yang Shi 
> > ---
> >  mm/migrate.c | 16 
> >  1 file changed, 4 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index a72994c68ec6..dc7cc7f3a124 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t 
> > *pgdat, struct page *page)
> >
> >   VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
> >
> > + /* Do not migrate THP mapped by multiple processes */
> > + if (PageTransHuge(page) && page_mapcount(page) > 1)
> > + return 0;
> > +
> >   /* Avoid migrating to a node that is nearly full */
> >   if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
> >   return 0;
> > @@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t 
> > *pgdat, struct page *page)
> >   if (isolate_lru_page(page))
> >   return 0;
> >
> > - /*
> > -  * migrate_misplaced_transhuge_page() skips page migration's usual
> > -  * check on page_count(), so we must do it here, now that the page
> > -  * has been isolated: a GUP pin, or any other pin, prevents migration.
> > -  * The expected page count is 3: 1 for page's mapcount and 1 for the
> > -  * caller's pin and 1 for the reference taken by isolate_lru_page().
> > -  */
> > - if (PageTransHuge(page) && page_count(page) != 3) {
> > - putback_lru_page(page);
> > - return 0;
> > - }
> > -
> >   page_lru = page_is_file_lru(page);
> >   mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
> >   thp_nr_pages(page));

Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling

2021-04-14 Thread Yang Shi

On Tue, Apr 13, 2021 at 7:44 PM Huang, Ying  wrote:
>
> Yang Shi  writes:
>
> > When the THP NUMA fault support was added THP migration was not supported 
> > yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since 
> > v4.14
> > THP migration has been supported so it doesn't make too much sense to still 
> > keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration 
> > implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
> >
> > If migration is failed just restore the old normal PMD.
> >
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Use migrate_misplaced_page() for both base page and THP NUMA hinting
> > fault and remove all the dead and duplicate code.
> >
> > Signed-off-by: Yang Shi 
> > ---
> >  include/linux/migrate.h |  23 --
> >  mm/huge_memory.c| 143 ++--
> >  mm/internal.h   |  18 
> >  mm/migrate.c| 177 
> >  4 files changed, 77 insertions(+), 284 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 4bb4e519e3f5..163d6f2b03d1 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
> >  #endif
> >
> >  #ifdef CONFIG_NUMA_BALANCING
> > -extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> > struct vm_area_struct *vma, int node);
> >  #else
> > -static inline bool pmd_trans_migrating(pmd_t pmd)
> > -{
> > - return false;
> > -}
> >  static inline int migrate_misplaced_page(struct page *page,
> >struct vm_area_struct *vma, int node)
> >  {
> > @@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page 
> > *page,
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> > -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > - struct vm_area_struct *vma,
> > - pmd_t *pmd, pmd_t entry,
> > - unsigned long address,
> > - struct page *page, int node);
> > -#else
> > -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > - struct vm_area_struct *vma,
> > - pmd_t *pmd, pmd_t entry,
> > - unsigned long address,
> > - struct page *page, int node)
> > -{
> > - return -EAGAIN;
> > -}
> > -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> > -
> > -
> >  #ifdef CONFIG_MIGRATION
> >
> >  /*
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 35cac4aeaf68..94981907fd4c 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault 
> > *vmf)
> >  {
> >   struct vm_area_struct *vma = vmf->vma;
> >   pmd_t pmd = vmf->orig_pmd;
> > - struct anon_vma *anon_vma = NULL;
> > + pmd_t oldpmd;
>
> nit: the usage of oldpmd and pmd in the function appears not very
> consistent.  How about make oldpmd == vmf->orig_pmd always.  While make
> pmd the changed one?

Thanks for the suggestion. Yes, it seemed neater. Will fix it in the
next version.

>
> Best Regards,
> Huang, Ying
>
> >   struct page *page;
> >   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > - int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> > + int page_nid = NUMA_NO_NODE;
> >   int target_nid,

[v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported

2021-04-13 Thread Yang Shi

A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
NUMA balancing and THP.  But S390 doesn't support THP migration so NUMA
balancing actually can't migrate any misplaced pages.

Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
by pointless NUMA hinting faults on S390.

Signed-off-by: Yang Shi 
---
 mm/huge_memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94981907fd4c..f63445f3a17d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1741,6 +1741,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned 
long old_addr,
  * Returns
  *  - 0 if PMD could not be locked
  *  - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
+ *  or if prot_numa but THP migration is not supported
  *  - HPAGE_PMD_NR if protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -1755,6 +1756,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
+   if (prot_numa && !thp_migration_supported())
+   return 1;
+
ptl = __pmd_trans_huge_lock(pmd, vma);
if (!ptl)
return 0;
-- 
2.26.2

[v2 PATCH 3/7] mm: thp: refactor NUMA fault handling

2021-04-13 Thread Yang Shi

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.

This patch reworked the NUMA fault handling to use generic migration 
implementation
to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

If migration is failed just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot,
it seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting
fault and remove all the dead and duplicate code.

Signed-off-by: Yang Shi 
---
 include/linux/migrate.h |  23 --
 mm/huge_memory.c| 143 ++--
 mm/internal.h   |  18 
 mm/migrate.c| 177 
 4 files changed, 77 insertions(+), 284 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4bb4e519e3f5..163d6f2b03d1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node);
 #else
-static inline bool pmd_trans_migrating(pmd_t pmd)
-{
-   return false;
-}
 static inline int migrate_misplaced_page(struct page *page,
 struct vm_area_struct *vma, int node)
 {
@@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page *page,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-   struct vm_area_struct *vma,
-   pmd_t *pmd, pmd_t entry,
-   unsigned long address,
-   struct page *page, int node);
-#else
-static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-   struct vm_area_struct *vma,
-   pmd_t *pmd, pmd_t entry,
-   unsigned long address,
-   struct page *page, int node)
-{
-   return -EAGAIN;
-}
-#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
-
-
 #ifdef CONFIG_MIGRATION
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 35cac4aeaf68..94981907fd4c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
pmd_t pmd = vmf->orig_pmd;
-   struct anon_vma *anon_vma = NULL;
+   pmd_t oldpmd;
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-   int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
+   int page_nid = NUMA_NO_NODE;
int target_nid, last_cpupid = -1;
-   bool page_locked;
bool migrated = false;
-   bool was_writable;
+   bool was_writable = pmd_savedwrite(pmd);
int flags = 0;
 
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-   if (unlikely(!pmd_same(pmd, *vmf->pmd)))
-   goto out_unlock;
-
-   /*
-* If there are potential migrations, wait for completion and retry
-* without disrupting NUMA hinting information. Do not relock and
-* check_same as the page may no longer be mapped.
-*/
-   if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
-   page = pmd_page(*vmf->pmd);
-   if (!get_page_unless_zero(page))
-   goto out_unlock;
+   if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
spin_unlock(vmf->ptl);
-   put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
goto out;
}
 
-   page = pmd_page(pmd);
-   BUG_ON(is_huge_zero_page(page));
-   page_nid = page_to_nid(page);
-   last_cpupid = page_cpupid_last(page);
-   count_vm_numa_event(NUMA_HINT_FAULTS);
-   if (page_nid == this_nid) {
-   count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-   flags |= TNF_FAULT_LOCAL;
-   }
-
-   /* See similar comment in do_numa_pa

[v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly

2021-04-13 Thread Yang Shi

Now both base page and THP NUMA migration is done via migrate_misplaced_page(),
keep the counters correctly for THP.

Signed-off-by: Yang Shi 
---
 mm/migrate.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 333448aa53f1..a473f25fbd01 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2111,6 +2111,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
LIST_HEAD(migratepages);
new_page_t *new;
bool compound;
+   unsigned int nr_pages = thp_nr_pages(page);
 
/*
 * PTE mapped THP or HugeTLB page can't reach here so the page could
@@ -2149,13 +2150,13 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
if (nr_remaining) {
if (!list_empty()) {
list_del(>lru);
-   dec_node_page_state(page, NR_ISOLATED_ANON +
-   page_is_file_lru(page));
+   mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+   page_is_file_lru(page), -nr_pages);
putback_lru_page(page);
}
isolated = 0;
} else
-   count_vm_numa_event(NUMA_PAGE_MIGRATE);
+   count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages);
BUG_ON(!list_empty());
return isolated;
 
-- 
2.26.2

[v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page

2021-04-13 Thread Yang Shi

The old behavior didn't split THP if migration is failed due to lack of
memory on the target node.  But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Signed-off-by: Yang Shi 
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a473f25fbd01..a72994c68ec6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1417,6 +1417,7 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
int swapwrite = current->flags & PF_SWAPWRITE;
int rc, nr_subpages;
LIST_HEAD(ret_pages);
+   bool nosplit = (reason == MR_NUMA_MISPLACED);
 
trace_mm_migrate_pages_start(mode, reason);
 
@@ -1488,8 +1489,9 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
/*
 * When memory is low, don't bother to try to 
migrate
 * other pages, just exit.
+* THP NUMA faulting doesn't split THP to retry.
 */
-   if (is_thp) {
+   if (is_thp && !nosplit) {
if (!try_split_thp(page, , from)) 
{
nr_thp_split++;
goto retry;
-- 
2.26.2

[v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-13 Thread Yang Shi

The generic migration path will check refcount, so no need check refcount here.
But the old code actually prevents from migrating shared THP (mapped by multiple
processes), so bail out early if mapcount is > 1 to keep the behavior.

Signed-off-by: Yang Shi 
---
 mm/migrate.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a72994c68ec6..dc7cc7f3a124 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
struct page *page)
 
VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
+   /* Do not migrate THP mapped by multiple processes */
+   if (PageTransHuge(page) && page_mapcount(page) > 1)
+   return 0;
+
/* Avoid migrating to a node that is nearly full */
if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
return 0;
@@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
struct page *page)
if (isolate_lru_page(page))
return 0;
 
-   /*
-* migrate_misplaced_transhuge_page() skips page migration's usual
-* check on page_count(), so we must do it here, now that the page
-* has been isolated: a GUP pin, or any other pin, prevents migration.
-* The expected page count is 3: 1 for page's mapcount and 1 for the
-* caller's pin and 1 for the reference taken by isolate_lru_page().
-*/
-   if (PageTransHuge(page) && page_count(page) != 3) {
-   putback_lru_page(page);
-   return 0;
-   }
-
page_lru = page_is_file_lru(page);
mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
thp_nr_pages(page));
-- 
2.26.2

[v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault

2021-04-13 Thread Yang Shi



Changelog:
v1 --> v2:
* Adopted the suggestion from Gerald Schaefer to skip huge PMD for S390
  for now.
* Used PageTransHuge to distinguish base page or THP instead of a new
  parameter for migrate_misplaced_page() per Huang Ying.
* Restored PMD lazily to avoid unnecessary TLB shootdown per Huang Ying.
* Skipped shared THP.
* Updated counters correctly.
* Rebased to linux-next (next-20210412).

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.  It is definitely a maintenance burden to keep two THP migration
implementation for different code paths and it is more error prone.  Using the
generic THP migration implementation allows us remove the duplicate code and
some hacks needed by the old ad hoc implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
and NUMA balancing.  The most of them support THP migration except for S390.
Zi Yan tried to add THP migration support for S390 before but it was not
accepted due to the design of S390 PMD.  For the discussion, please see:
https://lkml.org/lkml/2018/4/27/953.

Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge
PMD for S390 for now.

I saw there were some hacks about gup from git history, but I didn't figure out
if they have been removed or not since I just found FOLL_NUMA code in the 
current
gup implementation and they seems useful.

I'm trying to keep the behavior as consistent as possible between before and 
after.
But there is still some minor disparity.  For example, file THP won't
get migrated at all in old implementation due to the anon_vma check, but
the new implementation doesn't need acquire anon_vma lock anymore, so
file THP might get migrated.  Not sure if this behavior needs to be
kept.

Patch #1 ~ #2 are preparation patches.
Patch #3 is the real meat.
Patch #4 ~ #6 keep consistent counters and behaviors with before.
Patch #7 skips change huge PMD to prot_none if thp migration is not supported.

Yang Shi (7):
  mm: memory: add orig_pmd to struct vm_fault
  mm: memory: make numa_migrate_prep() non-static
  mm: thp: refactor NUMA fault handling
  mm: migrate: account THP NUMA migration counters correctly
  mm: migrate: don't split THP for misplaced NUMA page
  mm: migrate: check mapcount for THP instead of ref count
  mm: thp: skip make PMD PROT_NONE if THP migration is not supported

 include/linux/huge_mm.h |   9 ++---
 include/linux/migrate.h |  23 ---
 include/linux/mm.h  |   3 ++
 mm/huge_memory.c| 156 
+---
 mm/internal.h   |  21 ++
 mm/memory.c |  31 +++
 mm/migrate.c| 204 
+--
 7 files changed, 123 insertions(+), 324 deletions(-)

[v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static

2021-04-13 Thread Yang Shi

The numa_migrate_prep() will be used by huge NUMA fault as well in the following
patch, make it non-static.

Signed-off-by: Yang Shi 
---
 mm/internal.h | 3 +++
 mm/memory.c   | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f469f69309de..b6f889d9c22d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -659,4 +659,7 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned 
long end,
 
 void vunmap_range_noflush(unsigned long start, unsigned long end);
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr, int page_nid, int *flags);
+
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index e0cbffeceb0b..bbb38066021a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4119,9 +4119,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
return ret;
 }
 
-static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-   unsigned long addr, int page_nid,
-   int *flags)
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr, int page_nid, int *flags)
 {
get_page(page);
 
-- 
2.26.2

[v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault

2021-04-13 Thread Yang Shi

Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
fault could be removed, just like its PTE counterpart does.

Signed-off-by: Yang Shi 
---
 include/linux/huge_mm.h |  9 -
 include/linux/mm.h  |  3 +++
 mm/huge_memory.c|  9 ++---
 mm/memory.c | 26 +-
 4 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9626fda5efce..3b38070f4337 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
  struct vm_area_struct *vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
+void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
  struct vm_area_struct *vma);
@@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed(struct vm_fault 
*vmf, pud_t orig_pud)
 }
 #endif
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
   unsigned long addr, pmd_t *pmd,
   unsigned int flags);
@@ -283,7 +283,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 
@@ -429,8 +429,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
return NULL;
 }
 
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf,
-   pmd_t orig_pmd)
+static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25b9041f9925..9c5856f8cc81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -547,6 +547,9 @@ struct vm_fault {
 * the 'address'
 */
pte_t orig_pte; /* Value of PTE at the time of fault */
+   pmd_t orig_pmd; /* Value of PMD at the time of fault,
+* used by PMD fault only.
+*/
 
struct page *cow_page;  /* Page handler may use for COW fault */
struct page *page;  /* ->fault handlers should return a
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ed6b25deaa..35cac4aeaf68 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1251,11 +1251,12 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t 
orig_pud)
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
+void huge_pmd_set_accessed(struct vm_fault *vmf)
 {
pmd_t entry;
unsigned long haddr;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+   pmd_t orig_pmd = vmf->orig_pmd;
 
vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
@@ -1272,11 +1273,12 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t 
orig_pmd)
spin_unlock(vmf->ptl);
 }
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+   pmd_t orig_pmd = vmf->orig_pmd;
 
vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1412,9 +1414,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
+   pmd_t pmd = vmf->orig_pmd;
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
diff --git a/mm/memory.c b/mm/memory.c
index 4e358601c5d6..e0cbffeceb0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4232,12 +4232,12 @@ static inline vm_fault_t create_huge_pmd(struct 
vm_fault *vmf)
 }
 
 /* `inline' is required to avoid gcc 4.1.2 b

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-09 Thread Yang Shi

On Thu, Apr 8, 2021 at 7:58 PM Huang, Ying  wrote:
>
> Yang Shi  writes:
>
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt  wrote:
> >>
> >> Hi Tim,
> >>
> >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen  
> >> wrote:
> >> >
> >> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> >> > others NUMA wise, but a byte of media has about the same cost whether it
> >> > is close or far.  But, with new memory tiers such as Persistent Memory
> >> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> >> > PMEM.
> >> >
> >> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >> >
> >> > Previously, the patchset
> >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> >> > https://lore.kernel.org/linux-mm/20210401183216.443c4...@viggo.jf.intel.com/
> >> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >> >
> >> > And the patchset
> >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for 
> >> > memory tiering system
> >> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.hu...@intel.com/
> >> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> >> > leveraging autonuma.
> >> >
> >> > The two patchsets together keep the hot pages in DRAM and colder pages
> >> > in PMEM.
> >>
> >> Thanks for working on this as this is becoming more and more important
> >> particularly in the data centers where memory is a big portion of the
> >> cost.
> >>
> >> I see you have responded to Michal and I will add my more specific
> >> response there. Here I wanted to give my high level concern regarding
> >> using v1's soft limit like semantics for top tier memory.
> >>
> >> This patch series aims to distribute/partition top tier memory between
> >> jobs of different priorities. We want high priority jobs to have
> >> preferential access to the top tier memory and we don't want low
> >> priority jobs to hog the top tier memory.
> >>
> >> Using v1's soft limit like behavior can potentially cause high
> >> priority jobs to stall to make enough space on top tier memory on
> >> their allocation path and I think this patchset is aiming to reduce
> >> that impact by making kswapd do that work. However I think the more
> >> concerning issue is the low priority job hogging the top tier memory.
> >>
> >> The possible ways the low priority job can hog the top tier memory are
> >> by allocating non-movable memory or by mlocking the memory. (Oh there
> >> is also pinning the memory but I don't know if there is a user api to
> >> pin memory?) For the mlocked memory, you need to either modify the
> >> reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
>
> To optimize the page placement of a process between DRAM and PMEM, we
> want to place the hot pages in DRAM and the cold pages in PMEM.  But the
> memory accessing pattern changes overtime, so we need to migrate pages
> between DRAM and PMEM to adapt to the changing.
>
> To avoid the hot pages be pinned in PMEM always, one way is to online
> the PMEM as movable zones.  If so, and if the low priority jobs are
> restricted by cpuset to allocate from PMEM only, we may fail to run
> quite some workloads as being discussed in the following threads,
>
> https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.t...@intel.com/

Thanks for sharing the thread. It seems the configuration of movable
zone + node bind is not supported very well or need evolve to support
new use cases.

>
> >>
> >> Basically I am saying we should put the upfront control (limit) on the
> >> usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
>
> Per my understanding, memory.low/min are for the memory protection
> instead of the memory limiting.  memory.high is for the memory limiting.

Yes, it is not limit. I just misused the term, I actually do mean
protection but typed "limit". Sorry for the confusion.

>
> Best Regards,
> Huang, Ying

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-09 Thread Yang Shi

On Fri, Apr 9, 2021 at 8:50 AM Dave Hansen  wrote:
>
> On 4/8/21 11:17 AM, Oscar Salvador wrote:
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct 
> > compact_control *cc,
> >   cc->nr_migratepages -= nr_reclaimed;
> >
> >   ret = migrate_pages(>migratepages, alloc_migration_target,
> > - NULL, (unsigned long), cc->mode, 
> > MR_CONTIG_RANGE);
> > + NULL, (unsigned long), cc->mode, 
> > MR_CONTIG_RANGE,
> > + NULL);
> >   }
> >   if (ret < 0) {
> >   putback_movable_pages(>migratepages);
>
> I also considered passing NULL to mean "I don't care about
> nr_succeeded".  I mostly avoided it to reduce churn.  But, looking at it
> here, it does seem cleaner.
>
> Any objections to moving over to Oscar's suggestion?

No, fine to me.

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-09 Thread Yang Shi

On Thu, Apr 8, 2021 at 10:06 PM Oscar Salvador  wrote:
>
> On Thu, Apr 08, 2021 at 01:40:33PM -0700, Yang Shi wrote:
> > Thanks a lot for the example code. You didn't miss anything. At first
> > glance, I thought your suggestion seemed neater. Actually I
> > misunderstood what Dave said about "That could really have caused some
> > interesting problems." with multiple calls to migrate_pages(). I was
> > thinking about:
> >
> > unsigned long foo()
> > {
> > unsigned long *ret_succeeded;
> >
> > migrate_pages(..., ret_succeeded);
> >
> > migrate_pages(..., ret_succeeded);
> >
> > return *ret_succeeded;
> > }
>
> But that would not be a problem as well. I mean I am not sure what is
> foo() supposed to do.
> I assume is supposed to return the *total* number of pages that were
> migrated?
>
> Then could do something like:
>
>  unsigned long foo()
>  {
>  unsigned long ret_succeeded;
>  unsigned long total_succeeded = 0;
>
>  migrate_pages(..., _succeeded);
>  total_succeeded += ret_succeeded;
>
>  migrate_pages(..., _succeeded);
>  total_succeeded += ret_succeeded;
>
>  return *total_succeeded;
>  }
>
>  But AFAICS, you would have to do that with Wei Xu's version and with
>  mine, no difference there.

It is because nr_succeeded is reset for each migrate_pages() call.

You could do "*ret_succeeded += nr_succeeded" if we want an
accumulated counter, then you don't have to add total_succeeded. And
since nr_succeeded is reset for each migrate_pages() call, so both vm
counter and trace point are happy.

>
> IIUC, Dave's concern was that nr_succeeded was only set to 0 at the beginning
> of the function, and never reset back, which means, we would carry the
> sum of previous nr_succeeded instead of the nr_succeeded in that round.
> That would be misleading for e.g: reclaim in case we were to call
> migrate_pages() several times, as instead of a delta value, nr_succeeded
> would accumulate.

I think the most straightforward concern is the vm counter and trace
point in migrate_pages(), if migrate_pages() is called multiple times
we may see messed up counters if nr_succeeded is not reset properly.
Of course both your and Wei's suggestion solve this problem.

But if we have usecase which returns nr_succeeded and call
migrate_pages() multiple times, I think we do want to return
accumulated value IMHO.

>
> But that won't happen neither with Wei Xu's version nor with mine.
>
> --
> Oscar Salvador
> SUSE L3

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Yang Shi

On Thu, Apr 8, 2021 at 1:29 PM Shakeel Butt  wrote:
>
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi  wrote:
> >
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt  wrote:
> > >
> > > Hi Tim,
> > >
> > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen  
> > > wrote:
> > > >
> > > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster 
> > > > than
> > > > others NUMA wise, but a byte of media has about the same cost whether it
> > > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > > PMEM.
> > > >
> > > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > > >
> > > > Previously, the patchset
> > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > > https://lore.kernel.org/linux-mm/20210401183216.443c4...@viggo.jf.intel.com/
> > > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > > >
> > > > And the patchset
> > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for 
> > > > memory tiering system
> > > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.hu...@intel.com/
> > > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > > leveraging autonuma.
> > > >
> > > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > > in PMEM.
> > >
> > > Thanks for working on this as this is becoming more and more important
> > > particularly in the data centers where memory is a big portion of the
> > > cost.
> > >
> > > I see you have responded to Michal and I will add my more specific
> > > response there. Here I wanted to give my high level concern regarding
> > > using v1's soft limit like semantics for top tier memory.
> > >
> > > This patch series aims to distribute/partition top tier memory between
> > > jobs of different priorities. We want high priority jobs to have
> > > preferential access to the top tier memory and we don't want low
> > > priority jobs to hog the top tier memory.
> > >
> > > Using v1's soft limit like behavior can potentially cause high
> > > priority jobs to stall to make enough space on top tier memory on
> > > their allocation path and I think this patchset is aiming to reduce
> > > that impact by making kswapd do that work. However I think the more
> > > concerning issue is the low priority job hogging the top tier memory.
> > >
> > > The possible ways the low priority job can hog the top tier memory are
> > > by allocating non-movable memory or by mlocking the memory. (Oh there
> > > is also pinning the memory but I don't know if there is a user api to
> > > pin memory?) For the mlocked memory, you need to either modify the
> > > reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
> >
>
> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.
>
> > >
> > > Basically I am saying we should put the upfront control (limit) on the
> > > usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
> >
>
> The low and min limits have semantics similar to the v1's soft limit

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-08 Thread Yang Shi

On Thu, Apr 8, 2021 at 11:17 AM Oscar Salvador  wrote:
>
> On Thu, Apr 08, 2021 at 10:26:54AM -0700, Yang Shi wrote:
>
> > Thanks, Oscar. Yes, kind of. But we have to remember to initialize
> > "nr_succedded" pointer properly for every migrate_pages() callsite,
> > right? And it doesn't prevent from returning wrong value if
> > migrate_pages() is called multiple times by one caller although there
> > might be not such case (calls migrate_pages() multiple times and care
> > about nr_succeded) for now.
>
> Hi Yang,
>
> I might be missing something but AFAICS you only need to initialize
> nr_succeded pointer where it matters.
> The local nr_succeeded in migrate_pages() doesn't go, and so it gets
> initialized every time you call in it to 0.
> And if you pass a valid pointer, *ret_succeeded == nr_succedeed.
>
> I am talking about this (not even compile-tested):
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..fd661cb2ce13 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -40,7 +40,8 @@ extern int migrate_page(struct address_space *mapping,
> struct page *newpage, struct page *page,
> enum migrate_mode mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t 
> free,
> -   unsigned long private, enum migrate_mode mode, int reason);
> +   unsigned long private, enum migrate_mode mode, int reason,
> +   unsigned int *ret_succeeded);
>  extern struct page *alloc_migration_target(struct page *page, unsigned long 
> private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
>  extern void putback_movable_page(struct page *page);
> @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(struct address_space 
> *mapping,
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
> free_page_t free, unsigned long private, enum migrate_mode 
> mode,
> -   int reason)
> +   int reason, unsigned int *ret_succeeded)
> { return -ENOSYS; }
>  static inline struct page *alloc_migration_target(struct page *page,
> unsigned long private)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f4476e68e..7238e8faff04 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2364,7 +2364,7 @@ compact_zone(struct compact_control *cc, struct 
> capture_control *capc)
>
> err = migrate_pages(>migratepages, compaction_alloc,
> compaction_free, (unsigned long)cc, cc->mode,
> -   MR_COMPACTION);
> +   MR_COMPACTION, NULL);
>
> trace_mm_compaction_migratepages(cc->nr_migratepages, err,
> >migratepages);
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..b70d463aa1fc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1606,7 +1606,7 @@ static long check_and_migrate_cma_pages(struct 
> mm_struct *mm,
> put_page(pages[i]);
>
> if (migrate_pages(_page_list, alloc_migration_target, 
> NULL,
> -   (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE)) {
> +   (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE, 
> NULL)) {
> /*
>  * some of the pages failed migration. Do 
> get_user_pages
>  * without migration.
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..a17e0f039076 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1852,7 +1852,8 @@ static int __soft_offline_page(struct page *page)
>
> if (isolate_page(hpage, )) {
> ret = migrate_pages(, alloc_migration_target, NULL,
> -   (unsigned long), MIGRATE_SYNC, MR_MEMORY_FAILURE);
> +   (unsigned long), MIGRATE_SYNC, MR_MEMORY_FAILURE,
> +   NULL);
> if (!ret) {
> bool release = !huge;
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0cdbbfbc5757..28496376de94 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1466,7 +1466,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
> end_pfn)
> if (nodes_empty(nmask))
> node_set(mtc.nid, nmask);
> ret = migrate_pages(, alloc_migration_target, NULL,
> -

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Yang Shi

On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt  wrote:
>
> Hi Tim,
>
> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen  wrote:
> >
> > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > others NUMA wise, but a byte of media has about the same cost whether it
> > is close or far.  But, with new memory tiers such as Persistent Memory
> > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > PMEM.
> >
> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >
> > Previously, the patchset
> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > https://lore.kernel.org/linux-mm/20210401183216.443c4...@viggo.jf.intel.com/
> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >
> > And the patchset
> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory 
> > tiering system
> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.hu...@intel.com/
> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > leveraging autonuma.
> >
> > The two patchsets together keep the hot pages in DRAM and colder pages
> > in PMEM.
>
> Thanks for working on this as this is becoming more and more important
> particularly in the data centers where memory is a big portion of the
> cost.
>
> I see you have responded to Michal and I will add my more specific
> response there. Here I wanted to give my high level concern regarding
> using v1's soft limit like semantics for top tier memory.
>
> This patch series aims to distribute/partition top tier memory between
> jobs of different priorities. We want high priority jobs to have
> preferential access to the top tier memory and we don't want low
> priority jobs to hog the top tier memory.
>
> Using v1's soft limit like behavior can potentially cause high
> priority jobs to stall to make enough space on top tier memory on
> their allocation path and I think this patchset is aiming to reduce
> that impact by making kswapd do that work. However I think the more
> concerning issue is the low priority job hogging the top tier memory.
>
> The possible ways the low priority job can hog the top tier memory are
> by allocating non-movable memory or by mlocking the memory. (Oh there
> is also pinning the memory but I don't know if there is a user api to
> pin memory?) For the mlocked memory, you need to either modify the
> reclaim code or use a different mechanism for demoting cold memory.

Do you mean long term pin? RDMA should be able to simply pin the
memory for weeks. A lot of transient pins come from Direct I/O. They
should be less concerned.

The low priority jobs should be able to be restricted by cpuset, for
example, just keep them on second tier memory nodes. Then all the
above problems are gone.

>
> Basically I am saying we should put the upfront control (limit) on the
> usage of top tier memory by the jobs.

This sounds similar to what I talked about in LSFMM 2019
(https://lwn.net/Articles/787418/). We used to have some potential
usecase which divides DRAM:PMEM ratio for different jobs or memcgs
when I was with Alibaba.

In the first place I thought about per NUMA node limit, but it was
very hard to configure it correctly for users unless you know exactly
about your memory usage and hot/cold memory distribution.

I'm wondering, just off the top of my head, if we could extend the
semantic of low and min limit. For example, just redefine low and min
to "the limit on top tier memory". Then we could have low priority
jobs have 0 low/min limit.

>

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-08 Thread Yang Shi

On Thu, Apr 8, 2021 at 3:14 AM Oscar Salvador  wrote:
>
> On Thu, Apr 01, 2021 at 11:32:23AM -0700, Dave Hansen wrote:
> >
> > From: Yang Shi 
> >
> > The migrate_pages() returns the number of pages that were not migrated,
> > or an error code.  When returning an error code, there is no way to know
> > how many pages were migrated or not migrated.
> >
> > In the following patch, migrate_pages() is used to demote pages to PMEM
> > node, we need account how many pages are reclaimed (demoted) since page
> > reclaim behavior depends on this.  Add *nr_succeeded parameter to make
> > migrate_pages() return how many pages are demoted successfully for all
> > cases.
> >
> > Signed-off-by: Yang Shi 
> > Signed-off-by: Dave Hansen 
> > Reviewed-by: Yang Shi 
> > Cc: Wei Xu 
> > Cc: Huang Ying 
> > Cc: Dan Williams 
> > Cc: David Hildenbrand 
> > Cc: osalvador 
> >
>
> ...
> >  int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >   free_page_t put_new_page, unsigned long private,
> > - enum migrate_mode mode, int reason)
> > + enum migrate_mode mode, int reason, unsigned int 
> > *nr_succeeded)
> >  {
> >   int retry = 1;
> >   int thp_retry = 1;
> >   int nr_failed = 0;
> > - int nr_succeeded = 0;
> >   int nr_thp_succeeded = 0;
> >   int nr_thp_failed = 0;
> >   int nr_thp_split = 0;
> > @@ -1611,10 +1611,10 @@ retry:
> >   case MIGRATEPAGE_SUCCESS:
> >   if (is_thp) {
> >   nr_thp_succeeded++;
> > - nr_succeeded += nr_subpages;
> > + *nr_succeeded += nr_subpages;
> >   break;
> >   }
> > - nr_succeeded++;
> > + (*nr_succeeded)++;
> >   break;
> >   default:
> >   /*
> > @@ -1643,12 +1643,12 @@ out:
> >*/
> >   list_splice(_pages, from);
> >
> > - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> > + count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
> >   count_vm_events(PGMIGRATE_FAIL, nr_failed);
> >   count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
> >   count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
> >   count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> > - trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> > + trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
> >  nr_thp_failed, nr_thp_split, mode, reason);
>
> It seems that reclaiming is the only user who cared about how many pages
> could we migrated, could not do the following instead:
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 695a594e5860..d4170b7ea2fe 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1503,7 +1503,7 @@ static inline int try_split_thp(struct page *page, 
> struct page **page2,
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
> free_page_t put_new_page, unsigned long private,
> -   enum migrate_mode mode, int reason)
> +   enum migrate_mode mode, int reason, unsigned int 
> *ret_succeeded)
>  {
> int retry = 1;
> int thp_retry = 1;
> @@ -1654,6 +1654,9 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
> if (!swapwrite)
> current->flags &= ~PF_SWAPWRITE;
>
> +   if (ret_succedded)
> +   *ret_succedded = nr_succedded;
> +
> return rc;
>  }
>
>  And pass only a valid pointer from demote_page_list() and NULL from all
>  the others?
>  I was just wondered after all those "unsigned int nr_succedded" in all
>  other functions.
>  This would also solve the "be careful to initialize nr_succedded"
>  problem?

Thanks, Oscar. Yes, kind of. But we have to remember to initialize
"nr_succedded" pointer properly for every migrate_pages() callsite,
right? And it doesn't prevent from returning wrong value if
migrate_pages() is called multiple times by one caller although there
might be not such case (calls migrate_pages() multiple times and care
about nr_succeded) for now.

So IMHO I do prefer Wei's suggestion to have migrate_pages()
initialize nr_succeeded. This seems simpler.


>
>
> --
> Oscar Salvador
> SUSE L3

Re: [PATCH v2 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas

2021-04-07 Thread Yang Shi

On Tue, Apr 6, 2021 at 8:06 PM  wrote:
>
> From: Yanfei Xu 
>
> We could check MMF_DISABLE_THP ahead of iterating over all of vma.
> Otherwise if some mm_struct contain a large number of vma, there will
> be amounts meaningless cpu cycles cost.

Reviewed-by: Yang Shi 

>
> Signed-off-by: Yanfei Xu 
> ---
>  mm/khugepaged.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a6012b9259a2..f4ad25a7db55 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2094,6 +2094,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
> int pages,
>  */
> if (unlikely(!mmap_read_trylock(mm)))
> goto breakouterloop_mmap_lock;
> +   if (test_bit(MMF_DISABLE_THP, >flags))
> +   goto breakouterloop;
> if (likely(!khugepaged_test_exit(mm)))
> vma = find_vma(mm, khugepaged_scan.address);
>
> --
> 2.27.0
>

Re: [PATCH v2 1/2] mm: khugepaged: use macro to align addresses

2021-04-07 Thread Yang Shi

On Tue, Apr 6, 2021 at 8:06 PM  wrote:
>
> From: Yanfei Xu 
>
> We could use macro to deal with the addresses which need to be aligned
> to improve readability of codes.

Reviewed-by: Yang Shi 

>
> Signed-off-by: Yanfei Xu 
> ---
>  mm/khugepaged.c | 27 +--
>  1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a7d6cb912b05..a6012b9259a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -517,8 +517,8 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
> if (!hugepage_vma_check(vma, vm_flags))
> return 0;
>
> -   hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> -   hend = vma->vm_end & HPAGE_PMD_MASK;
> +   hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE);
> +   hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE);
> if (hstart < hend)
> return khugepaged_enter(vma, vm_flags);
> return 0;
> @@ -979,8 +979,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, 
> unsigned long address,
> if (!vma)
> return SCAN_VMA_NULL;
>
> -   hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> -   hend = vma->vm_end & HPAGE_PMD_MASK;
> +   hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE);
> +   hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE);
> if (address < hstart || address + HPAGE_PMD_SIZE > hend)
> return SCAN_ADDRESS_RANGE;
> if (!hugepage_vma_check(vma, vma->vm_flags))
> @@ -1070,7 +1070,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> struct mmu_notifier_range range;
> gfp_t gfp;
>
> -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +   VM_BUG_ON(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>
> /* Only allocate from the target node */
> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> @@ -1235,7 +1235,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> int node = NUMA_NO_NODE, unmapped = 0;
> bool writable = false;
>
> -   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +   VM_BUG_ON(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>
> pmd = mm_find_pmd(mm, address);
> if (!pmd) {
> @@ -1414,7 +1414,7 @@ static int khugepaged_add_pte_mapped_thp(struct 
> mm_struct *mm,
>  {
> struct mm_slot *mm_slot;
>
> -   VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
> +   VM_BUG_ON(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
>
> spin_lock(_mm_lock);
> mm_slot = get_mm_slot(mm);
> @@ -1437,7 +1437,7 @@ static int khugepaged_add_pte_mapped_thp(struct 
> mm_struct *mm,
>   */
>  void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>  {
> -   unsigned long haddr = addr & HPAGE_PMD_MASK;
> +   unsigned long haddr = ALIGN_DOWN(addr, HPAGE_PMD_SIZE);
> struct vm_area_struct *vma = find_vma(mm, haddr);
> struct page *hpage;
> pte_t *start_pte, *pte;
> @@ -1584,7 +1584,7 @@ static void retract_page_tables(struct address_space 
> *mapping, pgoff_t pgoff)
> if (vma->anon_vma)
> continue;
> addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << 
> PAGE_SHIFT);
> -   if (addr & ~HPAGE_PMD_MASK)
> +   if (!IS_ALIGNED(addr, HPAGE_PMD_SIZE))
> continue;
> if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> continue;
> @@ -2070,7 +2070,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
> int pages,
>  {
> struct mm_slot *mm_slot;
> struct mm_struct *mm;
> -   struct vm_area_struct *vma;
> +   struct vm_area_struct *vma = NULL;
> int progress = 0;
>
> VM_BUG_ON(!pages);
> @@ -2092,7 +2092,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
> int pages,
>  * Don't wait for semaphore (to avoid long wait times).  Just move to
>  * the next mm on the list.
>  */
> -   vma = NULL;
> if (unlikely(!mmap_read_trylock(mm)))
> goto breakouterloop_mmap_lock;
> if (likely(!khugepaged_test_exit(mm)))
> @@ -2112,15 +2111,15 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
> int pages,
> progress++;
> continue;
> }
> -   hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> -   hend = vma->vm_end & HPAGE_PMD_MASK;
> +   hstart = ALIGN(

Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-04-07 Thread Yang Shi

On Wed, Apr 7, 2021 at 1:32 AM Mel Gorman  wrote:
>
> On Tue, Apr 06, 2021 at 09:42:07AM -0700, Yang Shi wrote:
> > On Tue, Apr 6, 2021 at 5:03 AM Gerald Schaefer
> >  wrote:
> > >
> > > On Thu, 1 Apr 2021 13:10:49 -0700
> > > Yang Shi  wrote:
> > >
> > > [...]
> > > > > >
> > > > > > Yes, it could be. The old behavior of migration was to return 
> > > > > > -ENOMEM
> > > > > > if THP migration is not supported then split THP. That behavior was
> > > > > > not very friendly to some usecases, for example, memory policy and
> > > > > > migration lieu of reclaim (the upcoming). But I don't mean we 
> > > > > > restore
> > > > > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > > > > page is THP.
> > > > >
> > > > > OK, as long as we don't get any broken PMD migration entries 
> > > > > established
> > > > > for s390, some extra THP splitting would be acceptable I guess.
> > > >
> > > > There will be no migration PMD installed. The current behavior is a
> > > > no-op if THP migration is not supported.
> > >
> > > Ok, just for completeness, since Mel also replied that the split
> > > was not done on other architectures "because the loss from splitting
> > > exceeded the gain of improved locality":
> > >
> > > I did not mean to request extra splitting functionality for s390,
> > > simply skipping / ignoring large PMDs would also be fine for s390,
> > > no need to add extra complexity.
> >
> > Thank you. It could make life easier. The current code still converts
> > huge PMD to RPOTNONE even though THP migration is not supported. It is
> > easy to skip such PMDs hence cycles are saved for pointless NUMA
> > hinting page faults.
> >
> > Will do so in v2 if no objection from Mel as well.
>
> I did not get a chance to review this in time but if a v2 shows up,
> I'll at least run it through a battery of tests to measure the impact
> and hopefully find the time to do a proper review. Superficially I'm not
> opposed to using generic code for migration because even if it shows up a
> problem, it would be better to optimise the generic implementation than
> carry two similar implementations. I'm undecided on whether s390 should
> split+migrate rather than skip because I do not have a good overview of
> "typical workloads on s390 that benefit from NUMA balancing".

Thanks, Mel. I don't have an idea about S390 either. I will just skip
huge PMDs for S390 for now as Gerald suggested.

>
> --
> Mel Gorman
> SUSE Labs

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-06 Thread Yang Shi

On Tue, Apr 6, 2021 at 3:05 AM Bharata B Rao  wrote:
>
> On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote:
> > On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao  wrote:
> > >
> > > Hi,
> > >
> > > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > > consumption increases quite a lot (around 172G) when the containers are
> > > running. Most of it comes from slab (149G) and within slab, the majority 
> > > of
> > > it comes from kmalloc-32 cache (102G)
> > >
> > > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > > allocations of list_lru_one list. These lists are created whenever a
> > > FS mount happens. Specially two such lists are registered by 
> > > alloc_super(),
> > > one for dentry and another for inode shrinker list. And these lists
> > > are created for all possible NUMA nodes and for all given memcgs
> > > (memcg_nr_cache_ids to be particular)
> > >
> > > If,
> > >
> > > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > > B = Nr NUMA possible nodes
> > > C = memcg_nr_cache_ids
> > > D = size of each kmalloc-32 object: 32 bytes,
> > >
> > > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > > cache for list_lru creation is A*B*C*D bytes.
> >
> > Yes, this is exactly what the current implementation does.
> >
> > >
> > > Following factors contribute to the excessive allocations:
> > >
> > > - Lists are created for possible NUMA nodes.
> >
> > Yes, because filesystem caches (dentry and inode) are NUMA aware.
>
> True, but creating lists for possible nodes as against online nodes
> can hurt platforms where possible is typically higher than online.

I'm supposed just because hotplug doesn't handle memcg list_lru
creation/deletion.

Get much simpler and less-prone implementation by wasting some memory.

>
> >
> > > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and 
> > > additional
> > >   list_lrus are created when it grows. Thus we end up creating 
> > > list_lru_one
> > >   list_heads even for those memcgs which are yet to be created.
> > >   For example, when 1 memcgs are created, memcg_nr_cache_ids reach
> > >   a value of 12286.
> > > - When a memcg goes offline, the list elements are drained to the parent
> > >   memcg, but the list_head entry remains.
> > > - The lists are destroyed only when the FS is unmounted. So list_heads
> > >   for non-existing memcgs remain and continue to contribute to the
> > >   kmalloc-32 allocation. This is presumably done for performance
> > >   reason as they get reused when new memcgs are created, but they end up
> > >   consuming slab memory until then.
> >
> > The current implementation has list_lrus attached with super_block. So
> > the list can't be freed until the super block is unmounted.
> >
> > I'm looking into consolidating list_lrus more closely with memcgs. It
> > means the list_lrus will have the same life cycles as memcgs rather
> > than filesystems. This may be able to improve some. But I'm supposed
> > the filesystem will be unmounted once the container exits and the
> > memcgs will get offlined for your usecase.
>
> Yes, but when the containers are still running, the lists that get
> created for non-existing memcgs and non-relavent memcgs are the main
> cause of increased memory consumption.

Since kernel doesn't know about containers so kernel simply doesn't
know what memcgs are non-relevant.

>
> >
> > > - In case of containers, a few file systems get mounted and are specific
> > >   to the container namespace and hence to a particular memcg, but we
> > >   end up creating lists for all the memcgs.
> >
> > Yes, because the kernel is *NOT* aware of containers.
> >
> > >   As an example, if 7 FS mounts are done for every container and when
> > >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> > >   lists for each NUMA node. It appears that no elements will get added
> > >   to other than 2*7=14 of them in the case of containers.
> > >
> > > One straight forward way to prevent this excessive list_lru_one
> > > allocations is to limit the list_lru_one creation only to the
> > > relevant memcg. However I don't see an easy way to figure out
> > > that relevant memcg from F

Re: [PATCH 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas

2021-04-06 Thread Yang Shi

On Mon, Apr 5, 2021 at 8:05 PM Xu, Yanfei  wrote:
>
>
>
> On 4/6/21 10:51 AM, Xu, Yanfei wrote:
> >
> >
> > On 4/6/21 2:20 AM, Yang Shi wrote:
> >> [Please note: This e-mail is from an EXTERNAL e-mail address]
> >>
> >> On Sun, Apr 4, 2021 at 8:33 AM  wrote:
> >>>
> >>> From: Yanfei Xu 
> >>>
> >>> We could check MMF_DISABLE_THP ahead of iterating over all of vma.
> >>> Otherwise if some mm_struct contain a large number of vma, there will
> >>> be amounts meaningless cpu cycles cost.
> >>>
> >>> BTW, drop an unnecessary cond_resched(), because there is a another
> >>> cond_resched() followed it and no consumed invocation between them.
> >>>
> >>> Signed-off-by: Yanfei Xu 
> >>> ---
> >>>   mm/khugepaged.c | 3 ++-
> >>>   1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 2efe1d0c92ed..c293ec4a94ea 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -2094,6 +2094,8 @@ static unsigned int
> >>> khugepaged_scan_mm_slot(unsigned int pages,
> >>>   */
> >>>  if (unlikely(!mmap_read_trylock(mm)))
> >>>  goto breakouterloop_mmap_lock;
> >>> +   if (test_bit(MMF_DISABLE_THP, >flags))
> >>> +   goto breakouterloop_mmap_lock;
> >>
> >> It is fine to check this flag. But mmap_lock has been acquired so you
> >> should jump to breakouterloop.
> >
> > Oops! It's my fault. Thank you for pointing out this.
> > Will fix it in v2.
> >
> >>
> >>>  if (likely(!khugepaged_test_exit(mm)))
> >>>  vma = find_vma(mm, khugepaged_scan.address);
> >>>
> >>> @@ -2101,7 +2103,6 @@ static unsigned int
> >>> khugepaged_scan_mm_slot(unsigned int pages,
> >>>  for (; vma; vma = vma->vm_next) {
> >>>  unsigned long hstart, hend;
> >>>
> >>> -   cond_resched();
> >>
> >> I don't have a strong opinion for removing this cond_resched(). But
> >> IIUC khugepaged is a best effort job there is no harm to keep it IMHO.
> >>
> >
> > Yes, keeping it is no harm. But I think we should add it when we need.
> > Look at the blow codes, there are only some simple check between these
> > two cond_resched().  And we still have some cond_resched() in the
> > khugepaged_scan_file() and khugepaged_scan_pmd() which is the actual
> > wrok about collapsing. So I think it is unnecessary.  :)
> >
>
> BTW, the original author add this cond_resched() might be worry about
> the hugepage_vma_check() always return false due to the MMF_DISABLE_THP.
> But now we have moved it out of the for loop of iterating vma.

A little bit of archeology showed the cond_resched() was there in the
first place even before MMF_DISABLE_THP was introduced.

>
> um.. That is my guess..
>
> Thanks,
> Yanfei
>
> >  for (; vma; vma = vma->vm_next) {
> >  unsigned long hstart, hend;
> >
> >  cond_resched(); //here
> >  if (unlikely(khugepaged_test_exit(mm))) {
> >  progress++;
> >  break;
> >  }
> >  if (!hugepage_vma_check(vma, vma->vm_flags)) {
> > skip:
> >  progress++;
> >  continue;
> >  }
> >  hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE);
> >  hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE);
> >  if (hstart >= hend)
> >  goto skip;
> >  if (khugepaged_scan.address > hend)
> >  goto skip;
> >  if (khugepaged_scan.address < hstart)
> >  khugepaged_scan.address = hstart;
> >  VM_BUG_ON(!IS_ALIGNED(khugepaged_scan.address,
> > HPAGE_PMD_SIZE));
> >
> >  if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
> >  goto skip;
> >
> >  while (khugepaged_scan.address < hend) {
> >  int ret;
> >  cond_resched();//here
> >
> >
> >>>  if (unlikely(khugepaged_test_exit(mm))) {
> >>>  progress++;
> >>>  break;
> >>> --
> >>> 2.27.0
> >>>
> >>>

Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-04-06 Thread Yang Shi

On Tue, Apr 6, 2021 at 5:03 AM Gerald Schaefer
 wrote:
>
> On Thu, 1 Apr 2021 13:10:49 -0700
> Yang Shi  wrote:
>
> [...]
> > > >
> > > > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > > > if THP migration is not supported then split THP. That behavior was
> > > > not very friendly to some usecases, for example, memory policy and
> > > > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > > page is THP.
> > >
> > > OK, as long as we don't get any broken PMD migration entries established
> > > for s390, some extra THP splitting would be acceptable I guess.
> >
> > There will be no migration PMD installed. The current behavior is a
> > no-op if THP migration is not supported.
>
> Ok, just for completeness, since Mel also replied that the split
> was not done on other architectures "because the loss from splitting
> exceeded the gain of improved locality":
>
> I did not mean to request extra splitting functionality for s390,
> simply skipping / ignoring large PMDs would also be fine for s390,
> no need to add extra complexity.

Thank you. It could make life easier. The current code still converts
huge PMD to RPOTNONE even though THP migration is not supported. It is
easy to skip such PMDs hence cycles are saved for pointless NUMA
hinting page faults.

Will do so in v2 if no objection from Mel as well.

Re: [PATCH 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas

2021-04-05 Thread Yang Shi

On Sun, Apr 4, 2021 at 8:33 AM  wrote:
>
> From: Yanfei Xu 
>
> We could check MMF_DISABLE_THP ahead of iterating over all of vma.
> Otherwise if some mm_struct contain a large number of vma, there will
> be amounts meaningless cpu cycles cost.
>
> BTW, drop an unnecessary cond_resched(), because there is a another
> cond_resched() followed it and no consumed invocation between them.
>
> Signed-off-by: Yanfei Xu 
> ---
>  mm/khugepaged.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2efe1d0c92ed..c293ec4a94ea 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2094,6 +2094,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
> int pages,
>  */
> if (unlikely(!mmap_read_trylock(mm)))
> goto breakouterloop_mmap_lock;
> +   if (test_bit(MMF_DISABLE_THP, >flags))
> +   goto breakouterloop_mmap_lock;

It is fine to check this flag. But mmap_lock has been acquired so you
should jump to breakouterloop.

> if (likely(!khugepaged_test_exit(mm)))
> vma = find_vma(mm, khugepaged_scan.address);
>
> @@ -2101,7 +2103,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
> int pages,
> for (; vma; vma = vma->vm_next) {
> unsigned long hstart, hend;
>
> -   cond_resched();

I don't have a strong opinion for removing this cond_resched(). But
IIUC khugepaged is a best effort job there is no harm to keep it IMHO.

> if (unlikely(khugepaged_test_exit(mm))) {
> progress++;
> break;
> --
> 2.27.0
>
>

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-05 Thread Yang Shi

On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao  wrote:
>
> Hi,
>
> When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> consumption increases quite a lot (around 172G) when the containers are
> running. Most of it comes from slab (149G) and within slab, the majority of
> it comes from kmalloc-32 cache (102G)
>
> The major allocator of kmalloc-32 slab cache happens to be the list_head
> allocations of list_lru_one list. These lists are created whenever a
> FS mount happens. Specially two such lists are registered by alloc_super(),
> one for dentry and another for inode shrinker list. And these lists
> are created for all possible NUMA nodes and for all given memcgs
> (memcg_nr_cache_ids to be particular)
>
> If,
>
> A = Nr allocation request per mount: 2 (one for dentry and inode list)
> B = Nr NUMA possible nodes
> C = memcg_nr_cache_ids
> D = size of each kmalloc-32 object: 32 bytes,
>
> then for every mount, the amount of memory consumed by kmalloc-32 slab
> cache for list_lru creation is A*B*C*D bytes.

Yes, this is exactly what the current implementation does.

>
> Following factors contribute to the excessive allocations:
>
> - Lists are created for possible NUMA nodes.

Yes, because filesystem caches (dentry and inode) are NUMA aware.

> - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
>   list_lrus are created when it grows. Thus we end up creating list_lru_one
>   list_heads even for those memcgs which are yet to be created.
>   For example, when 1 memcgs are created, memcg_nr_cache_ids reach
>   a value of 12286.
> - When a memcg goes offline, the list elements are drained to the parent
>   memcg, but the list_head entry remains.
> - The lists are destroyed only when the FS is unmounted. So list_heads
>   for non-existing memcgs remain and continue to contribute to the
>   kmalloc-32 allocation. This is presumably done for performance
>   reason as they get reused when new memcgs are created, but they end up
>   consuming slab memory until then.

The current implementation has list_lrus attached with super_block. So
the list can't be freed until the super block is unmounted.

I'm looking into consolidating list_lrus more closely with memcgs. It
means the list_lrus will have the same life cycles as memcgs rather
than filesystems. This may be able to improve some. But I'm supposed
the filesystem will be unmounted once the container exits and the
memcgs will get offlined for your usecase.

> - In case of containers, a few file systems get mounted and are specific
>   to the container namespace and hence to a particular memcg, but we
>   end up creating lists for all the memcgs.

Yes, because the kernel is *NOT* aware of containers.

>   As an example, if 7 FS mounts are done for every container and when
>   10k containers are created, we end up creating 2*7*12286 list_lru_one
>   lists for each NUMA node. It appears that no elements will get added
>   to other than 2*7=14 of them in the case of containers.
>
> One straight forward way to prevent this excessive list_lru_one
> allocations is to limit the list_lru_one creation only to the
> relevant memcg. However I don't see an easy way to figure out
> that relevant memcg from FS mount path (alloc_super())
>
> As an alternative approach, I have this below hack that does lazy
> list_lru creation. The memcg-specific list is created and initialized
> only when there is a request to add an element to that particular
> list. Though I am not sure about the full impact of this change
> on the owners of the lists and also the performance impact of this,
> the overall savings look good.

It is fine to reduce the memory consumption for your usecase, but I'm
not sure if this would incur any noticeable overhead for vfs
operations since list_lru_add() should be called quite often, but it
just needs to allocate the list for once (for each memcg +
filesystem), so the overhead might be fine.

And I'm wondering how much memory can be saved for real life workload.
I don't expect most containers are idle in production environments.

Added some more memcg/list_lru experts in this loop, they may have better ideas.

>
> Used memory
> Before  During  After
> W/o patch   23G 172G40G
> W/  patch   23G 69G 29G
>
> Slab consumption
> Before  During  After
> W/o patch   1.5G149G22G
> W/  patch   1.5G45G 10G
>
> Number of kmalloc-32 allocations
> Before  During  After
> W/o patch   178176  3442409472  388933632
> W/  patch   190464  468992  468992
>
> Any thoughts on other approaches to address this scenario and
> any specific comments about the approach that I have taken is
> appreciated. Meanwhile the patch looks

Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-04-01 Thread Yang Shi

On Wed, Mar 31, 2021 at 8:17 AM Johannes Weiner  wrote:
>
> On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song 
> > > > >  wrote:
> > > > > >
> > > > > > Since the following patchsets applied. All the kernel memory are 
> > > > > > charged
> > > > > > with the new APIs of obj_cgroup.
> > > > > >
> > > > > > [v17,00/19] The new cgroup slab memory controller
> > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > >
> > > > > > But user memory allocations (LRU pages) pinning memcgs for a long 
> > > > > > time -
> > > > > > it exists at a larger scale and is causing recurring problems in 
> > > > > > the real
> > > > > > world: page cache doesn't get reclaimed for a long time, or is used 
> > > > > > by the
> > > > > > second, third, fourth, ... instance of the same job that was 
> > > > > > restarted into
> > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste 
> > > > > > memory,
> > > > > > and make page reclaim very inefficient.
> > > > > >
> > > > > > We can convert LRU pages and most other raw memcg pins to the objcg 
> > > > > > direction
> > > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > > >
> > > > > > This patchset aims to make the LRU pages to drop the reference to 
> > > > > > memory
> > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that 
> > > > > > the number
> > > > > > of the dying cgroups will not increase if we run the following test 
> > > > > > script.
> > > > > >
> > > > > > ```bash
> > > > > > #!/bin/bash
> > > > > >
> > > > > > cat /proc/cgroups | grep memory
> > > > > >
> > > > > > cd /sys/fs/cgroup/memory
> > > > > >
> > > > > > for i in range{1..500}
> > > > > > do
> > > > > > mkdir test
> > > > > > echo $$ > test/cgroup.procs
> > > > > > sleep 60 &
> > > > > > echo $$ > cgroup.procs
> > > > > > echo `cat test/cgroup.procs` > cgroup.procs
> > > > > > rmdir test
> > > > > > done
> > > > > >
> > > > > > cat /proc/cgroups | grep memory
> > > > > > ```
> > > > > >
> > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > >
> > > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > > the long running systems, the root memcg will become a dumping ground.
> > > > > In addition a job running multiple times on a machine will see
> > > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > > reparented to the root memcg.
> > > >
> > > > I agree, but also the reparenting is not the perfect thing in a 
> > > > combination
> > > > with any memory protections (e.g. memory.low).
> > > >
> > > > Imagine the following configuration:
> > > > workload.slice
> > > > - workload_gen_1.service   memory.min = 30G
> > > > - workload_gen_2.service   memory.min = 30G
> > > > - workload_gen_3.service   memory.min = 30G
> > > >   ...
> > > >
> > > > Parent cgroup and several generations of the child cgroup, protected by 
> > > > a memory.low.
> > > > Once the memory is getting reparented, it's not protected anymore.
> > >
> > > That doesn't sound right.
> > >
> > > A deleted cgroup today exerts no control over its abandoned
> > > pages. css_reset() will blow out any control settings.
> >
> > I know. Currently it works in the following way: once cgroup gen_1 is 
> > deleted,
> > it's memory is not protected anymore, so eventually it's getting evicted and
> > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change 
> > this,
> > of course. But long-term we likely wanna re-charge such pages to new cgroups
> > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups 
> > doesn't
> > help and likely will complicate this change. So I'm a bit skeptical here.
>
> We should be careful with the long-term plans.

Excuse me for a dumb question. I recall we did reparent LRU pages
before (before 4.x kernel). I vaguely recall there were some tricky
race conditions during reparenting so we didn't do it anymore once
reclaimer could reclaim from offlined memcgs. My memory may be wrong,
if it is not so please feel free to correct me. If my memory is true,
it means the race conditions are gone? Or the new obj_cgroup APIs made
life easier?

Thanks,
Yang

>
> The zombie issue is a pretty urgent concern that has caused several
> production emergencies now. It needs a fix sooner rather than later.
>
> The long-term plans of how to handle shared/reused data better will
> require some time to work out. There are MANY open questions around
> recharging to arbitrary foreign

Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-04-01 Thread Yang Shi

On Wed, Mar 31, 2021 at 6:20 AM Mel Gorman  wrote:
>
> On Tue, Mar 30, 2021 at 04:42:00PM +0200, Gerald Schaefer wrote:
> > Could there be a work-around by splitting THP pages instead of marking them
> > as migrate pmds (via pte swap entries), at least when THP migration is not
> > supported? I guess it could also be acceptable if THP pages were simply not
> > migrated for NUMA balancing on s390, but then we might need some extra 
> > config
> > option to make that behavior explicit.
> >
>
> The split is not done on other architectures simply because the loss
> from splitting exceeded the gain of improved locality in too many cases.
> However, it might be ok as an s390-specific workaround.
>
> (Note, I haven't read the rest of the series due to lack of time but this
> query caught my eye).

Will wait for your comments before I post v2. Thanks.

>
> --
> Mel Gorman
> SUSE Labs

Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-04-01 Thread Yang Shi

On Wed, Mar 31, 2021 at 4:47 AM Gerald Schaefer
 wrote:
>
> On Tue, 30 Mar 2021 09:51:46 -0700
> Yang Shi  wrote:
>
> > On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
> >  wrote:
> > >
> > > On Mon, 29 Mar 2021 11:33:06 -0700
> > > Yang Shi  wrote:
> > >
> > > >
> > > > When the THP NUMA fault support was added THP migration was not 
> > > > supported yet.
> > > > So the ad hoc THP migration was implemented in NUMA fault handling.  
> > > > Since v4.14
> > > > THP migration has been supported so it doesn't make too much sense to 
> > > > still keep
> > > > another THP migration implementation rather than using the generic 
> > > > migration
> > > > code.  It is definitely a maintenance burden to keep two THP migration
> > > > implementation for different code paths and it is more error prone.  
> > > > Using the
> > > > generic THP migration implementation allows us remove the duplicate 
> > > > code and
> > > > some hacks needed by the old ad hoc implementation.
> > > >
> > > > A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support 
> > > > both THP
> > > > and NUMA balancing.  The most of them support THP migration except for 
> > > > S390.
> > > > Zi Yan tried to add THP migration support for S390 before but it was not
> > > > accepted due to the design of S390 PMD.  For the discussion, please see:
> > > > https://lkml.org/lkml/2018/4/27/953.
> > > >
> > > > I'm not expert on S390 so not sure if it is feasible to support THP 
> > > > migration
> > > > for S390 or not.  If it is not feasible then the patchset may make THP 
> > > > NUMA
> > > > balancing not be functional on S390.  Not sure if this is a show 
> > > > stopper although
> > > > the patchset does simplify the code a lot.  Anyway it seems worth 
> > > > posting the
> > > > series to the mailing list to get some feedback.
> > >
> > > The reason why THP migration cannot work on s390 is because the migration
> > > code will establish swap ptes in a pmd. The pmd layout is very different 
> > > from
> > > the pte layout on s390, so you cannot simply write a swap pte into a pmd.
> > > There are no separate swp primitives for swap/migration pmds, IIRC. And 
> > > even
> > > if there were, we'd still need to find some space for a present bit in the
> > > s390 pmd, and/or possibly move around some other bits.
> > >
> > > A lot of things can go wrong here, even if it could be possible in theory,
> > > by introducing separate swp primitives in common code for pmd entries, 
> > > along
> > > with separate offset, type, shift, etc. I don't see that happening in the
> > > near future.
> >
> > Thanks a lot for elaboration. IIUC, implementing migration PMD entry
> > is *not* prevented from by hardware, it may be very tricky to
> > implement it, right?
>
> Well, it depends. The HW is preventing proper full-blown swap + migration
> support for PMD, similar to what we have for PTE, because we simply don't
> have enough OS-defined bits in the PMD. A 5-bit swap type for example,
> similar to a PTE, plus the PFN would not be possible.
>
> The HW would not prevent a similar mechanism in principle, i.e. we could
> mark it as invalid to trigger a fault, and have some magic bits that tell
> the fault handler or migration code what it is about.
>
> For handling migration aspects only, w/o any swap device or other support, a
> single type bit could already be enough, to indicate read/write migration,
> plus a "present" bit similar to PTE. But even those 2 bits would be hard to
> find, though I would not entirely rule that out. That would be the tricky
> part.
>
> Then of course, common code would need some changes, to reflect the
> different swap/migration (type) capabilities of PTE and PMD entries.
> Not sure if such an approach would be acceptable for common code.
>
> But this is just some very abstract and optimistic view, I have not
> really properly looked into the details. So it might be even more
> tricky, or not possible at all.

Thanks a lot for the elaboration.

>
> >
> > >
> > > Not sure if this is a show stopper, but I am not familiar enough with
> > > NUMA and migration code to judge. E.g., I do not see any swp entry action
> > > in your patches, but I assume this is implicitly triggered by the switch
> > &

Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-04-01 Thread Yang Shi

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> Once this is enabled page demotion may move data to a NUMA node
> that does not fall into the cpuset of the allocating process.
> This could be construed to violate the guarantees of cpusets.
> However, since this is an opt-in mechanism, the assumption is
> that anyone enabling it is content to relax the guarantees.
>
> Signed-off-by: Dave Hansen 
> Cc: Wei Xu 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> Changes since 20200122:
>  * Changelog material about relaxing cpuset constraints
>
> Changes since 20210304:
>  * Add Documentation/ material about relaxing cpuset constraints

Reviewed-by: Yang Shi 

> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |   12 
>  b/include/linux/swap.h|3 ++-
>  b/include/uapi/linux/mempolicy.h  |1 +
>  b/mm/vmscan.c |6 --
>  4 files changed, 19 insertions(+), 3 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
> Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-31 
> 15:17:40.324000190 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-31 15:17:40.349000190 
> -0700
> @@ -976,6 +976,7 @@ This is value OR'ed together of
>  1  Zone reclaim on
>  2  Zone reclaim writes dirty pages out
>  4  Zone reclaim swaps pages
> +8  Zone reclaim migrates pages
>  =  ===
>
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -1000,3 +1001,14 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.  It may move data to a NUMA node that does not
> +fall into the cpuset of the allocating process which might be construed
> +to violate the guarantees of cpusets.  This should not be enabled on
> +systems which need strict cpuset location guarantees.
> diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
> --- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-31 15:17:40.331000190 
> -0700
> +++ b/include/linux/swap.h  2021-03-31 15:17:40.351000190 -0700
> @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
>  static inline bool node_reclaim_enabled(void)
>  {
> /* Is any node_reclaim_mode bit set? */
> -   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
> +   RECLAIM_UNMAP|RECLAIM_MIGRATE);
>  }
>
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
> include/uapi/linux/mempolicy.h
> --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-31 
> 15:17:40.337000190 -0700
> +++ b/include/uapi/linux/mempolicy.h2021-03-31 15:17:40.352000190 -0700
> @@ -71,5 +71,6 @@ enum {
>  #define RECLAIM_ZONE   (1

Re: [PATCH 05/10] mm/migrate: demote pages during reclaim

2021-04-01 Thread Yang Shi

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> This is mostly derived from a patch from Yang Shi:
>
> 
> https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/
>
> Add code to the reclaim path (shrink_page_list()) to "demote" data
> to another NUMA node instead of discarding the data.  This always
> avoids the cost of I/O needed to read the page back in and sometimes
> avoids the writeout cost when the pagee is dirty.

s/pagee/page

>
> A second pass through shrink_page_list() will be made if any demotions
> fail.  This essentally falls back to normal reclaim behavior in the

s/essentally/essentially

> case that demotions fail.  Previous versions of this patch may have
> simply failed to reclaim pages which were eligible for demotion but
> were unable to be demoted in practice.
>
> Note: This just adds the start of infratructure for migration. It is

s/infratructure/infrastructure

> actually disabled next to the FIXME in migrate_demote_page_ok().
>
> Signed-off-by: Dave Hansen 
> Cc: Wei Xu 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: osalvador 
>
> --
> changes from 20210122:
>  * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE (Ying)
>
> changes from 202010:
>  * add MR_NUMA_MISPLACED to trace MIGRATE_REASON define
>  * make migrate_demote_page_ok() static, remove 'sc' arg until
>later patch
>  * remove unnecessary alloc_demote_page() hugetlb warning
>  * Simplify alloc_demote_page() gfp mask.  Depend on
>__GFP_NORETRY to make it lightweight instead of fancier
>stuff like leaving out __GFP_IO/FS.
>  * Allocate migration page with alloc_migration_target()
>instead of allocating directly.
> changes from 20200730:
>  * Add another pass through shrink_page_list() when demotion
>fails.
> changes from 20210302:
>  * Use __GFP_THISNODE and revise the comment explaining the
>GFP mask constructionn

Other than some typos above this patch looks good to me. Reviewed-by:
Yang Shi 

Another nit below:

> ---
>
>  b/include/linux/migrate.h|9 
>  b/include/trace/events/migrate.h |3 -
>  b/mm/vmscan.c|   82 
> +++
>  3 files changed, 93 insertions(+), 1 deletion(-)
>
> diff -puN include/linux/migrate.h~demote-with-migrate_pages 
> include/linux/migrate.h
> --- a/include/linux/migrate.h~demote-with-migrate_pages 2021-03-31 
> 15:17:15.842000251 -0700
> +++ b/include/linux/migrate.h   2021-03-31 15:17:15.853000251 -0700
> @@ -27,6 +27,7 @@ enum migrate_reason {
> MR_MEMPOLICY_MBIND,
> MR_NUMA_MISPLACED,
> MR_CONTIG_RANGE,
> +   MR_DEMOTION,
> MR_TYPES
>  };
>
> @@ -196,6 +197,14 @@ struct migrate_vma {
>  int migrate_vma_setup(struct migrate_vma *args);
>  void migrate_vma_pages(struct migrate_vma *migrate);
>  void migrate_vma_finalize(struct migrate_vma *migrate);
> +int next_demotion_node(int node);
> +
> +#else /* CONFIG_MIGRATION disabled: */
> +
> +static inline int next_demotion_node(int node)
> +{
> +   return NUMA_NO_NODE;
> +}
>
>  #endif /* CONFIG_MIGRATION */
>
> diff -puN include/trace/events/migrate.h~demote-with-migrate_pages 
> include/trace/events/migrate.h
> --- a/include/trace/events/migrate.h~demote-with-migrate_pages  2021-03-31 
> 15:17:15.846000251 -0700
> +++ b/include/trace/events/migrate.h2021-03-31 15:17:15.853000251 -0700
> @@ -20,7 +20,8 @@
> EM( MR_SYSCALL, "syscall_or_cpuset")\
> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
> EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
> -   EMe(MR_CONTIG_RANGE,"contig_range")
> +   EM( MR_CONTIG_RANGE,"contig_range") \
> +   EMe(MR_DEMOTION,"demotion")
>
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c
> --- a/mm/vmscan.c~demote-with-migrate_pages 2021-03-31 15:17:15.848000251 
> -0700
> +++ b/mm/vmscan.c   2021-03-31 15:17:15.856000251 -0700
> @@ -41,6 +41,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1035,6 +1036,23 @@ static enum page_references page_check_r
> return PAGEREF_RECLAIM;
>  }
>
> +static bool migrate_demote_page_ok(struct page *page)
> +{
> +   int next_nid = next_demotion_node(page_to_nid(page));
> +
> +   VM_BUG_ON_PAGE(!PageLocked(page)

Re: [PATCH mmotm] mm: vmscan: fix shrinker_rwsem in free_shrinker_info()

2021-03-31 Thread Yang Shi

On Wed, Mar 31, 2021 at 2:13 PM Hugh Dickins  wrote:
>
> On Wed, 31 Mar 2021, Yang Shi wrote:
> > On Wed, Mar 31, 2021 at 6:54 AM Shakeel Butt  wrote:
> > > On Tue, Mar 30, 2021 at 4:44 PM Hugh Dickins  wrote:
> > > >
> > > > Lockdep warns mm/vmscan.c: suspicious rcu_dereference_protected() usage!
> > > > when free_shrinker_info() is called from mem_cgroup_css_free(): there it
> > > > is called with no locking, whereas alloc_shrinker_info() calls it with
> > > > down_write of shrinker_rwsem - which seems appropriate.  Rearrange that
> > > > so free_shrinker_info() can manage the shrinker_rwsem for itself.
> > > >
> > > > Link: 
> > > > https://lkml.kernel.org/r/20210317140615.GB28839@xsang-OptiPlex-9020
> > > > Reported-by: kernel test robot 
> > > > Signed-off-by: Hugh Dickins 
> > > > Cc: Yang Shi 
> > > > ---
> > > > Sorry, I've made no attempt to work out precisely where in the series
> > > > the locking went missing, nor tried to fit this in as a fix on top of
> > > > mm-vmscan-add-shrinker_info_protected-helper.patch
> > > > which Oliver reported (and which you notated in mmotm's "series" file).
> > > > This patch just adds the fix to the end of the series, after
> > > > mm-vmscan-shrink-deferred-objects-proportional-to-priority.patch
> > >
> > > The patch "mm: vmscan: add shrinker_info_protected() helper" replaces
> > > rcu_dereference_protected(shrinker_info, true) with
> > > rcu_dereference_protected(shrinker_info,
> > > lockdep_is_held(_rwsem)).
> > >
> > > I think we don't really need shrinker_rwsem in free_shrinker_info()
> > > which is called from css_free(). The bits of the map have already been
> > > 'reparented' in css_offline. I think we can remove
> > > lockdep_is_held(_rwsem) for free_shrinker_info().
> >
> > Thanks, Hugh and Shakeel. I missed the report.
> >
> > I think Shakeel is correct, shrinker_rwsem is not required in css_free
> > path so Shakeel's proposal should be able to fix it.
>
> Yes, looking at it again, I am sure that Shakeel is right, and
> that my patch was overkill - no need for shrinker_rwsem there.
>
> Whether it's RCU-safe to free the info there, I have not reviewed at
> all: but shrinker_rwsem would not help even if there were an issue.
>
> > I prepared a patch:
>
> Unsigned, white-space damaged, so does not apply.
>
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64bf07cc20f2..7348c26d4cac 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -251,7 +251,12 @@ void free_shrinker_info(struct mem_cgroup *memcg)
> > for_each_node(nid) {
> > pn = memcg->nodeinfo[nid];
> > -   info = shrinker_info_protected(memcg, nid);
> > +   /*
> > +* Don't use shrinker_info_protected() helper since
> > +* free_shrinker_info() could be called by css_free()
> > +* without holding shrinker_rwsem.
> > +*/
>
> Just because I mis-inferred from the use of shrinker_info_protected()
> that shrinker_rwsem was needed here, is no reason to add that comment:
> imagine how unhelpfully bigger the kernel source would be if we added
> a comment everywhere I had misunderstood something!

Yes, I agree the comment may incur more confusion. Better remove it.

>
> > +   info = rcu_dereference_protected(pn->shrinker_info, true);
> > kvfree(info);
> > rcu_assign_pointer(pn->shrinker_info, NULL);
> > }
>
> That does it, but I bikeshedded with myself in the encyclopaedic
> rcupdate.h, and decided rcu_replace_pointer(pn->shrinker_info, NULL, true)
> would be best.  But now see that patch won't fit so well into your series,
> and I can't spend more time writing up a justification for it.
>
> I think Andrew should simply delete my fix patch from his queue,
> and edit out the
> @@ -232,7 +239,7 @@ void free_shrinker_info(struct mem_cgrou
>
> for_each_node(nid) {
> pn = memcg->nodeinfo[nid];
> -   info = rcu_dereference_protected(pn->shrinker_info, true);
> +   info = shrinker_info_protected(memcg, nid);
> kvfree(info);
> rcu_assign_pointer(pn->shrinker_info, NULL);
> }
> hunk from your mm-vmscan-add-shrinker_info_protected-helper.patch
> which will then restore free_shrinker_info() to what you propose above.

Yes. I saw Andrew already had this fix in -mm tree.

>
> Thanks,
> Hugh

Re: [PATCH mmotm] mm: vmscan: fix shrinker_rwsem in free_shrinker_info()

2021-03-31 Thread Yang Shi

On Wed, Mar 31, 2021 at 6:54 AM Shakeel Butt  wrote:
>
> On Tue, Mar 30, 2021 at 4:44 PM Hugh Dickins  wrote:
> >
> > Lockdep warns mm/vmscan.c: suspicious rcu_dereference_protected() usage!
> > when free_shrinker_info() is called from mem_cgroup_css_free(): there it
> > is called with no locking, whereas alloc_shrinker_info() calls it with
> > down_write of shrinker_rwsem - which seems appropriate.  Rearrange that
> > so free_shrinker_info() can manage the shrinker_rwsem for itself.
> >
> > Link: https://lkml.kernel.org/r/20210317140615.GB28839@xsang-OptiPlex-9020
> > Reported-by: kernel test robot 
> > Signed-off-by: Hugh Dickins 
> > Cc: Yang Shi 
> > ---
> > Sorry, I've made no attempt to work out precisely where in the series
> > the locking went missing, nor tried to fit this in as a fix on top of
> > mm-vmscan-add-shrinker_info_protected-helper.patch
> > which Oliver reported (and which you notated in mmotm's "series" file).
> > This patch just adds the fix to the end of the series, after
> > mm-vmscan-shrink-deferred-objects-proportional-to-priority.patch
>
> The patch "mm: vmscan: add shrinker_info_protected() helper" replaces
> rcu_dereference_protected(shrinker_info, true) with
> rcu_dereference_protected(shrinker_info,
> lockdep_is_held(_rwsem)).
>
> I think we don't really need shrinker_rwsem in free_shrinker_info()
> which is called from css_free(). The bits of the map have already been
> 'reparented' in css_offline. I think we can remove
> lockdep_is_held(_rwsem) for free_shrinker_info().

Thanks, Hugh and Shakeel. I missed the report.

I think Shakeel is correct, shrinker_rwsem is not required in css_free
path so Shakeel's proposal should be able to fix it. I prepared a
patch:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64bf07cc20f2..7348c26d4cac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -251,7 +251,12 @@ void free_shrinker_info(struct mem_cgroup *memcg)
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
-   info = shrinker_info_protected(memcg, nid);
+   /*
+* Don't use shrinker_info_protected() helper since
+* free_shrinker_info() could be called by css_free()
+* without holding shrinker_rwsem.
+*/
+   info = rcu_dereference_protected(pn->shrinker_info, true);
kvfree(info);
rcu_assign_pointer(pn->shrinker_info, NULL);
}
>
> >
> >  mm/vmscan.c |   10 ++
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > --- mmotm/mm/vmscan.c   2021-03-28 17:26:54.935553064 -0700
> > +++ linux/mm/vmscan.c   2021-03-30 15:55:13.374459559 -0700
> > @@ -249,18 +249,20 @@ void free_shrinker_info(struct mem_cgrou
> > struct shrinker_info *info;
> > int nid;
> >
> > +   down_write(_rwsem);
> > for_each_node(nid) {
> > pn = memcg->nodeinfo[nid];
> > info = shrinker_info_protected(memcg, nid);
> > kvfree(info);
> > rcu_assign_pointer(pn->shrinker_info, NULL);
> > }
> > +   up_write(_rwsem);
> >  }
> >
> >  int alloc_shrinker_info(struct mem_cgroup *memcg)
> >  {
> > struct shrinker_info *info;
> > -   int nid, size, ret = 0;
> > +   int nid, size;
> > int map_size, defer_size = 0;
> >
> > down_write(_rwsem);
> > @@ -270,9 +272,9 @@ int alloc_shrinker_info(struct mem_cgrou
> > for_each_node(nid) {
> > info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> > if (!info) {
> > +   up_write(_rwsem);
> > free_shrinker_info(memcg);
> > -   ret = -ENOMEM;
> > -   break;
> > +   return -ENOMEM;
> > }
> > info->nr_deferred = (atomic_long_t *)(info + 1);
> > info->map = (void *)info->nr_deferred + defer_size;
> > @@ -280,7 +282,7 @@ int alloc_shrinker_info(struct mem_cgrou
> > }
> > up_write(_rwsem);
> >
> > -   return ret;
> > +   return 0;
> >  }
> >
> >  static inline bool need_expand(int nr_max)

[v2 PATCH] mm: gup: remove FOLL_SPLIT

2021-03-30 Thread Yang Shi

Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT
has not been used anymore.  Remove the dead code.

Reviewed-by: John Hubbard 
Signed-off-by: Yang Shi 
---
v2: Remove the reference in documentation.

 Documentation/vm/transhuge.rst |  5 -
 include/linux/mm.h |  1 -
 mm/gup.c   | 28 ++--
 3 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst
index 0ed23e59abe5..216db1d67d04 100644
--- a/Documentation/vm/transhuge.rst
+++ b/Documentation/vm/transhuge.rst
@@ -53,11 +53,6 @@ prevent the page from being split by anyone.
of handling GUP on hugetlbfs will also work fine on transparent
hugepage backed mappings.
 
-In case you can't handle compound pages if they're returned by
-follow_page, the FOLL_SPLIT bit can be specified as a parameter to
-follow_page, so that it will split the hugepages before returning
-them.
-
 Graceful fallback
 =
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..3568836841f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2780,7 +2780,6 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_NOWAIT0x20/* if a disk transfer is needed, start the IO
 * and return without waiting upon it */
 #define FOLL_POPULATE  0x40/* fault in page */
-#define FOLL_SPLIT 0x80/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
 #define FOLL_NUMA  0x200   /* force NUMA hinting page fault */
 #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..f3d45a8f18ae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -435,18 +435,6 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
}
}
 
-   if (flags & FOLL_SPLIT && PageTransCompound(page)) {
-   get_page(page);
-   pte_unmap_unlock(ptep, ptl);
-   lock_page(page);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (ret)
-   return ERR_PTR(ret);
-   goto retry;
-   }
-
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -591,7 +579,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags, >pgmap);
}
-   if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
+   if (flags & FOLL_SPLIT_PMD) {
int ret;
page = pmd_page(*pmd);
if (is_huge_zero_page(page)) {
@@ -600,19 +588,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
split_huge_pmd(vma, pmd, address);
if (pmd_trans_unstable(pmd))
ret = -EBUSY;
-   } else if (flags & FOLL_SPLIT) {
-   if (unlikely(!try_get_page(page))) {
-   spin_unlock(ptl);
-   return ERR_PTR(-ENOMEM);
-   }
-   spin_unlock(ptl);
-   lock_page(page);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (pmd_none(*pmd))
-   return no_page_table(vma, flags);
-   } else {  /* flags & FOLL_SPLIT_PMD */
+   } else {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
-- 
2.26.2

Re: [PATCH 4/6] mm: thp: refactor NUMA fault handling

2021-03-30 Thread Yang Shi

On Mon, Mar 29, 2021 at 5:41 PM Huang, Ying  wrote:
>
> Yang Shi  writes:
>
> > When the THP NUMA fault support was added THP migration was not supported 
> > yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since 
> > v4.14
> > THP migration has been supported so it doesn't make too much sense to still 
> > keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration 
> > implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Restore PMD
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
>
> There's some behavior change between the original implementation and
> your new implementation.  Originally, PMD is restored after trying to
> migrate the misplaced THP.  I think this can reduce the TLB
> shooting-down IPI.

In theory, yes. However I'm not sure if it is really measurable to
real life workload. I would like to make the huge NUMA hinting behave
like its PTE counterpart for now. If your PTE NUMA hinting patch
proves the TLB shooting down optimization is really worth it we could
easily do it for the PMD side too.

>
> Best Regards,
> Huang, Ying
>
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Signed-off-by: Yang Shi 
> > ---
> >  include/linux/migrate.h |  23 --
> >  mm/huge_memory.c| 132 --
> >  mm/migrate.c| 173 ++--
> >  3 files changed, 57 insertions(+), 271 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 6abd34986cc5..6c8640e9af4f 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -100,15 +100,10 @@ static inline void __ClearPageMovable(struct page 
> > *page)
> >  #endif
> >
> >  #ifdef CONFIG_NUMA_BALANCING
> > -extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> > struct vm_area_struct *vma, int node,
> > bool compound);
> >  #else
> > -static inline bool pmd_trans_migrating(pmd_t pmd)
> > -{
> > - return false;
> > -}
> >  static inline int migrate_misplaced_page(struct page *page,
> >struct vm_area_struct *vma, int node,
> >bool compound)
> > @@ -117,24 +112,6 @@ static inline int migrate_misplaced_page(struct page 
> > *page,
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> > -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > - struct vm_area_struct *vma,
> > - pmd_t *pmd, pmd_t entry,
> > - unsigned long address,
> > - struct page *page, int node);
> > -#else
> > -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > - struct vm_area_struct *vma,
> > - pmd_t *pmd, pmd_t entry,
> > - unsigned long address,
> > - struct page *page, int node)
> > -{
> > - return -EAGAIN;
> > -}
> > -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> > -
> > -
> >  #ifdef CONFIG_MIGRATION
> >
> >  /*
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 53f3843ce72a..157c63b0fd95 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1419,94 +1419,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault 
> > *vmf)
> >  {
> >   struct vm_area_struct *vma = vmf->vma;
> >   pmd_t pmd = vmf->orig_pmd;
> > - struct anon_vma *anon_vma = NULL;
> >   struct page *page;
> >   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > - int page_

Re: [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP

2021-03-30 Thread Yang Shi

On Mon, Mar 29, 2021 at 5:21 PM Huang, Ying  wrote:
>
> Yang Shi  writes:
>
> > In the following patch the migrate_misplaced_page() will be used to migrate 
> > THP
> > for NUMA faul too.  Prepare to deal with THP.
> >
> > Signed-off-by: Yang Shi 
> > ---
> >  include/linux/migrate.h | 6 --
> >  mm/memory.c | 2 +-
> >  mm/migrate.c| 2 +-
> >  3 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 3a389633b68f..6abd34986cc5 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -102,14 +102,16 @@ static inline void __ClearPageMovable(struct page 
> > *page)
> >  #ifdef CONFIG_NUMA_BALANCING
> >  extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> > -   struct vm_area_struct *vma, int node);
> > +   struct vm_area_struct *vma, int node,
> > +   bool compound);
> >  #else
> >  static inline bool pmd_trans_migrating(pmd_t pmd)
> >  {
> >   return false;
> >  }
> >  static inline int migrate_misplaced_page(struct page *page,
> > -  struct vm_area_struct *vma, int node)
> > +  struct vm_area_struct *vma, int node,
> > +  bool compound)
> >  {
> >   return -EAGAIN; /* can't migrate now */
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 003bbf3187d4..7fed578bdc31 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4169,7 +4169,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> >   }
> >
> >   /* Migrate to the requested node */
> > - migrated = migrate_misplaced_page(page, vma, target_nid);
> > + migrated = migrate_misplaced_page(page, vma, target_nid, false);
> >   if (migrated) {
> >   page_nid = target_nid;
> >   flags |= TNF_MIGRATED;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 62b81d5257aa..9c4ae5132919 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2127,7 +2127,7 @@ static inline bool is_shared_exec_page(struct 
> > vm_area_struct *vma,
> >   * the page that will be dropped by this function before returning.
> >   */
> >  int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> > -int node)
> > +int node, bool compound)
>
> Can we just use PageCompound(page) instead?

Yes, I think so. The current base page NUMA hinting does bail out
early if PTE mapped THP is met. So the THP just could be PMD mapped as
long as it reaches here.

>
> Best Regards,
> Huang, Ying
>
> >  {
> >   pg_data_t *pgdat = NODE_DATA(node);
> >   int isolated;
>

Re: [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page

2021-03-30 Thread Yang Shi

On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
 wrote:
>
> On Mon, 29 Mar 2021 11:33:11 -0700
> Yang Shi  wrote:
>
> > The old behavior didn't split THP if migration is failed due to lack of
> > memory on the target node.  But the THP migration does split THP, so keep
> > the old behavior for misplaced NUMA page migration.
> >
> > Signed-off-by: Yang Shi 
> > ---
> >  mm/migrate.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 86325c750c14..1c0c873375ab 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1444,6 +1444,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> > get_new_page,
> >   int swapwrite = current->flags & PF_SWAPWRITE;
> >   int rc, nr_subpages;
> >   LIST_HEAD(ret_pages);
> > + bool nosplit = (reason == MR_NUMA_MISPLACED);
> >
> >   if (!swapwrite)
> >   current->flags |= PF_SWAPWRITE;
> > @@ -1495,7 +1496,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> > get_new_page,
> >*/
> >   case -ENOSYS:
> >   /* THP migration is unsupported */
> > - if (is_thp) {
> > + if (is_thp && !nosplit) {
>
> This is the "THP migration is unsupported" case, but according to your
> description you rather want to change the -ENOMEM case?
>
> Could this be the correct place to trigger THP split for NUMA balancing,
> for architectures not supporting THP migration, like s390?

Yes, I think it could be as I mentioned in the previous email.

>
> Do I understand it correctly that this change (for -ENOSYS) would
> result in always failed THP migrations during NUMA balancing, if THP
> migration was not supported?

Yes.

Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-03-30 Thread Yang Shi

On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
 wrote:
>
> On Mon, 29 Mar 2021 11:33:06 -0700
> Yang Shi  wrote:
>
> >
> > When the THP NUMA fault support was added THP migration was not supported 
> > yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since 
> > v4.14
> > THP migration has been supported so it doesn't make too much sense to still 
> > keep
> > another THP migration implementation rather than using the generic migration
> > code.  It is definitely a maintenance burden to keep two THP migration
> > implementation for different code paths and it is more error prone.  Using 
> > the
> > generic THP migration implementation allows us remove the duplicate code and
> > some hacks needed by the old ad hoc implementation.
> >
> > A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> > and NUMA balancing.  The most of them support THP migration except for S390.
> > Zi Yan tried to add THP migration support for S390 before but it was not
> > accepted due to the design of S390 PMD.  For the discussion, please see:
> > https://lkml.org/lkml/2018/4/27/953.
> >
> > I'm not expert on S390 so not sure if it is feasible to support THP 
> > migration
> > for S390 or not.  If it is not feasible then the patchset may make THP NUMA
> > balancing not be functional on S390.  Not sure if this is a show stopper 
> > although
> > the patchset does simplify the code a lot.  Anyway it seems worth posting 
> > the
> > series to the mailing list to get some feedback.
>
> The reason why THP migration cannot work on s390 is because the migration
> code will establish swap ptes in a pmd. The pmd layout is very different from
> the pte layout on s390, so you cannot simply write a swap pte into a pmd.
> There are no separate swp primitives for swap/migration pmds, IIRC. And even
> if there were, we'd still need to find some space for a present bit in the
> s390 pmd, and/or possibly move around some other bits.
>
> A lot of things can go wrong here, even if it could be possible in theory,
> by introducing separate swp primitives in common code for pmd entries, along
> with separate offset, type, shift, etc. I don't see that happening in the
> near future.

Thanks a lot for elaboration. IIUC, implementing migration PMD entry
is *not* prevented from by hardware, it may be very tricky to
implement it, right?

>
> Not sure if this is a show stopper, but I am not familiar enough with
> NUMA and migration code to judge. E.g., I do not see any swp entry action
> in your patches, but I assume this is implicitly triggered by the switch
> to generic THP migration code.

Yes, exactly. The migrate_pages() called by migrate_misplaced_page()
takes care of everything.

>
> Could there be a work-around by splitting THP pages instead of marking them
> as migrate pmds (via pte swap entries), at least when THP migration is not
> supported? I guess it could also be acceptable if THP pages were simply not
> migrated for NUMA balancing on s390, but then we might need some extra config
> option to make that behavior explicit.

Yes, it could be. The old behavior of migration was to return -ENOMEM
if THP migration is not supported then split THP. That behavior was
not very friendly to some usecases, for example, memory policy and
migration lieu of reclaim (the upcoming). But I don't mean we restore
the old behavior. We could split THP if it returns -ENOSYS and the
page is THP.

>
> See also my comment on patch #5 of this series.
>
> Regards,
> Gerald

Re: [PATCH] mm: gup: remove FOLL_SPLIT

2021-03-30 Thread Yang Shi

On Tue, Mar 30, 2021 at 12:08 AM John Hubbard  wrote:
>
> On 3/29/21 12:38 PM, Yang Shi wrote:
> > Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of 
> > FOLL_SPLIT")
> > and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT
> > has not been used anymore.  Remove the dead code.
> >
> > Signed-off-by: Yang Shi 
> > ---
> >   include/linux/mm.h |  1 -
> >   mm/gup.c   | 28 ++--
> >   2 files changed, 2 insertions(+), 27 deletions(-)
> >
>
> Looks nice.
>
> As long as I'm running git grep here, there is one more search hit that 
> should also
> be fixed up, as part of a "remove FOLL_SPLIT" patch:
>
> git grep -nw FOLL_SPLIT
> Documentation/vm/transhuge.rst:57:follow_page, the FOLL_SPLIT bit can be 
> specified as a parameter to
>
> Reviewed-by: John Hubbard 

Thanks. Removed the reference to FOLL_SPLIT in documentation for v2.

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8ba434287387..3568836841f9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2780,7 +2780,6 @@ struct page *follow_page(struct vm_area_struct *vma, 
> > unsigned long address,
> >   #define FOLL_NOWAIT 0x20/* if a disk transfer is needed, start the IO
> >* and return without waiting upon it */
> >   #define FOLL_POPULATE   0x40/* fault in page */
> > -#define FOLL_SPLIT   0x80/* don't return transhuge pages, split them */
> >   #define FOLL_HWPOISON   0x100   /* check page is hwpoisoned */
> >   #define FOLL_NUMA   0x200   /* force NUMA hinting page fault */
> >   #define FOLL_MIGRATION  0x400   /* wait for page to replace migration 
> > entry */
> > diff --git a/mm/gup.c b/mm/gup.c
> > index e40579624f10..f3d45a8f18ae 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -435,18 +435,6 @@ static struct page *follow_page_pte(struct 
> > vm_area_struct *vma,
> >   }
> >   }
> >
> > - if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> > - get_page(page);
> > - pte_unmap_unlock(ptep, ptl);
> > - lock_page(page);
> > - ret = split_huge_page(page);
> > - unlock_page(page);
> > - put_page(page);
> > - if (ret)
> > - return ERR_PTR(ret);
> > - goto retry;
> > - }
> > -
> >   /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
> >   if (unlikely(!try_grab_page(page, flags))) {
> >   page = ERR_PTR(-ENOMEM);
> > @@ -591,7 +579,7 @@ static struct page *follow_pmd_mask(struct 
> > vm_area_struct *vma,
> >   spin_unlock(ptl);
> >   return follow_page_pte(vma, address, pmd, flags, >pgmap);
> >   }
> > - if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
> > + if (flags & FOLL_SPLIT_PMD) {
> >   int ret;
> >   page = pmd_page(*pmd);
> >   if (is_huge_zero_page(page)) {
> > @@ -600,19 +588,7 @@ static struct page *follow_pmd_mask(struct 
> > vm_area_struct *vma,
> >   split_huge_pmd(vma, pmd, address);
> >   if (pmd_trans_unstable(pmd))
> >   ret = -EBUSY;
> > - } else if (flags & FOLL_SPLIT) {
> > - if (unlikely(!try_get_page(page))) {
> > - spin_unlock(ptl);
> > - return ERR_PTR(-ENOMEM);
> > - }
> > - spin_unlock(ptl);
> > - lock_page(page);
> > - ret = split_huge_page(page);
> > - unlock_page(page);
> > - put_page(page);
> > - if (pmd_none(*pmd))
> > - return no_page_table(vma, flags);
> > - } else {  /* flags & FOLL_SPLIT_PMD */
> > + } else {
> >   spin_unlock(ptl);
> >   split_huge_pmd(vma, pmd, address);
> >   ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> >
>

[PATCH] mm: gup: remove FOLL_SPLIT

2021-03-29 Thread Yang Shi

Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT
has not been used anymore.  Remove the dead code.

Signed-off-by: Yang Shi 
---
 include/linux/mm.h |  1 -
 mm/gup.c   | 28 ++--
 2 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..3568836841f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2780,7 +2780,6 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_NOWAIT0x20/* if a disk transfer is needed, start the IO
 * and return without waiting upon it */
 #define FOLL_POPULATE  0x40/* fault in page */
-#define FOLL_SPLIT 0x80/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
 #define FOLL_NUMA  0x200   /* force NUMA hinting page fault */
 #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..f3d45a8f18ae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -435,18 +435,6 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
}
}
 
-   if (flags & FOLL_SPLIT && PageTransCompound(page)) {
-   get_page(page);
-   pte_unmap_unlock(ptep, ptl);
-   lock_page(page);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (ret)
-   return ERR_PTR(ret);
-   goto retry;
-   }
-
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -591,7 +579,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags, >pgmap);
}
-   if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
+   if (flags & FOLL_SPLIT_PMD) {
int ret;
page = pmd_page(*pmd);
if (is_huge_zero_page(page)) {
@@ -600,19 +588,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
split_huge_pmd(vma, pmd, address);
if (pmd_trans_unstable(pmd))
ret = -EBUSY;
-   } else if (flags & FOLL_SPLIT) {
-   if (unlikely(!try_get_page(page))) {
-   spin_unlock(ptl);
-   return ERR_PTR(-ENOMEM);
-   }
-   spin_unlock(ptl);
-   lock_page(page);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (pmd_none(*pmd))
-   return no_page_table(vma, flags);
-   } else {  /* flags & FOLL_SPLIT_PMD */
+   } else {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
-- 
2.26.2

[PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page

2021-03-29 Thread Yang Shi

The old behavior didn't split THP if migration is failed due to lack of
memory on the target node.  But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Signed-off-by: Yang Shi 
---
 mm/migrate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 86325c750c14..1c0c873375ab 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,6 +1444,7 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
int swapwrite = current->flags & PF_SWAPWRITE;
int rc, nr_subpages;
LIST_HEAD(ret_pages);
+   bool nosplit = (reason == MR_NUMA_MISPLACED);
 
if (!swapwrite)
current->flags |= PF_SWAPWRITE;
@@ -1495,7 +1496,7 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
 */
case -ENOSYS:
/* THP migration is unsupported */
-   if (is_thp) {
+   if (is_thp && !nosplit) {
if (!try_split_thp(page, , from)) 
{
nr_thp_split++;
goto retry;
-- 
2.26.2

[PATCH 4/6] mm: thp: refactor NUMA fault handling

2021-03-29 Thread Yang Shi

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.

This patch reworked the NUMA fault handling to use generic migration 
implementation
to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Restore PMD
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot,
it seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Signed-off-by: Yang Shi 
---
 include/linux/migrate.h |  23 --
 mm/huge_memory.c| 132 --
 mm/migrate.c| 173 ++--
 3 files changed, 57 insertions(+), 271 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 6abd34986cc5..6c8640e9af4f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -100,15 +100,10 @@ static inline void __ClearPageMovable(struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
  struct vm_area_struct *vma, int node,
  bool compound);
 #else
-static inline bool pmd_trans_migrating(pmd_t pmd)
-{
-   return false;
-}
 static inline int migrate_misplaced_page(struct page *page,
 struct vm_area_struct *vma, int node,
 bool compound)
@@ -117,24 +112,6 @@ static inline int migrate_misplaced_page(struct page *page,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-   struct vm_area_struct *vma,
-   pmd_t *pmd, pmd_t entry,
-   unsigned long address,
-   struct page *page, int node);
-#else
-static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-   struct vm_area_struct *vma,
-   pmd_t *pmd, pmd_t entry,
-   unsigned long address,
-   struct page *page, int node)
-{
-   return -EAGAIN;
-}
-#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
-
-
 #ifdef CONFIG_MIGRATION
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 53f3843ce72a..157c63b0fd95 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1419,94 +1419,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
pmd_t pmd = vmf->orig_pmd;
-   struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-   int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
+   int page_nid = NUMA_NO_NODE;
int target_nid, last_cpupid = -1;
-   bool page_locked;
bool migrated = false;
-   bool was_writable;
+   bool was_writable = pmd_savedwrite(pmd);
int flags = 0;
 
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-   if (unlikely(!pmd_same(pmd, *vmf->pmd)))
-   goto out_unlock;
-
-   /*
-* If there are potential migrations, wait for completion and retry
-* without disrupting NUMA hinting information. Do not relock and
-* check_same as the page may no longer be mapped.
-*/
-   if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
-   page = pmd_page(*vmf->pmd);
-   if (!get_page_unless_zero(page))
-   goto out_unlock;
-   spin_unlock(vmf->ptl);
-   put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
-   goto out;
-   }
-
-   page = pmd_page(pmd);
-   BUG_ON(is_huge_zero_page(page));
-   page_nid = page_to_nid(page);
-   last_cpupid = page_cpupid_last(page);
-   count_vm_numa_event(NUMA_HINT_FAULTS);
-   if (page_nid == this_nid) {
-   count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-   flags |= TNF_FAULT_LOCAL;
-   }
-
-   /* See similar comment in do_numa_page for explanation */
-   if (!pmd_savedwrite(pmd))
-   flags |= TNF_NO_GROUP;
-
-   /*
-* Acquire the page lock to serialise THP migrations but avoid drop

[PATCH 6/6] mm: migrate: remove redundant page count check for THP

2021-03-29 Thread Yang Shi

Don't have to keep the redundant page count check for THP anymore after
switching to use generic migration code.

Signed-off-by: Yang Shi 
---
 mm/migrate.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1c0c873375ab..328f76848d6c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2097,18 +2097,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
struct page *page)
if (isolate_lru_page(page))
return 0;
 
-   /*
-* migrate_misplaced_transhuge_page() skips page migration's usual
-* check on page_count(), so we must do it here, now that the page
-* has been isolated: a GUP pin, or any other pin, prevents migration.
-* The expected page count is 3: 1 for page's mapcount and 1 for the
-* caller's pin and 1 for the reference taken by isolate_lru_page().
-*/
-   if (PageTransHuge(page) && page_count(page) != 3) {
-   putback_lru_page(page);
-   return 0;
-   }
-
page_lru = page_is_file_lru(page);
mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
thp_nr_pages(page));
-- 
2.26.2

[PATCH 2/6] mm: memory: make numa_migrate_prep() non-static

2021-03-29 Thread Yang Shi

The numa_migrate_prep() will be used by huge NUMA fault as well in the following
patch, make it non-static.

Signed-off-by: Yang Shi 
---
 mm/internal.h | 3 +++
 mm/memory.c   | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..5ac525f364e6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -618,4 +618,7 @@ struct migration_target_control {
gfp_t gfp_mask;
 };
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr, int page_nid, int *flags);
+
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 33be5811ac65..003bbf3187d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4078,9 +4078,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
return ret;
 }
 
-static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-   unsigned long addr, int page_nid,
-   int *flags)
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr, int page_nid, int *flags)
 {
get_page(page);
 
-- 
2.26.2

[RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault

2021-03-29 Thread Yang Shi



When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.  It is definitely a maintenance burden to keep two THP migration
implementation for different code paths and it is more error prone.  Using the
generic THP migration implementation allows us remove the duplicate code and
some hacks needed by the old ad hoc implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
and NUMA balancing.  The most of them support THP migration except for S390.
Zi Yan tried to add THP migration support for S390 before but it was not
accepted due to the design of S390 PMD.  For the discussion, please see:
https://lkml.org/lkml/2018/4/27/953.

I'm not expert on S390 so not sure if it is feasible to support THP migration
for S390 or not.  If it is not feasible then the patchset may make THP NUMA
balancing not be functional on S390.  Not sure if this is a show stopper 
although
the patchset does simplify the code a lot.  Anyway it seems worth posting the
series to the mailing list to get some feedback.

Patch #1 ~ #3 are preparation and clean up patches.
Patch #4 is the real meat.
Patch #5 keep THP not split if migration is failed for NUMA hinting.
Patch #6 removes a hack about page refcount.

I saw there were some hacks about gup from git history, but I didn't figure out
if they have been removed or not since I just found FOLL_NUMA code in the 
current
gup implementation and they seems useful.


Yang Shi (6):
  mm: memory: add orig_pmd to struct vm_fault
  mm: memory: make numa_migrate_prep() non-static
  mm: migrate: teach migrate_misplaced_page() about THP
  mm: thp: refactor NUMA fault handling
  mm: migrate: don't split THP for misplaced NUMA page
  mm: migrate: remove redundant page count check for THP

 include/linux/huge_mm.h |   9 ++---
 include/linux/migrate.h |  29 ++-
 include/linux/mm.h  |   1 +
 mm/huge_memory.c| 141 
+++---
 mm/internal.h   |   3 ++
 mm/memory.c |  33 -
 mm/migrate.c| 190 
++-
 7 files changed, 94 insertions(+), 312 deletions(-)

[PATCH 1/6] mm: memory: add orig_pmd to struct vm_fault

2021-03-29 Thread Yang Shi

Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
fault could be removed, just like its PTE counterpart does.

Signed-off-by: Yang Shi 
---
 include/linux/huge_mm.h |  9 -
 include/linux/mm.h  |  1 +
 mm/huge_memory.c|  9 ++---
 mm/memory.c | 26 +-
 4 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ba973efcd369..5650db25a49d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
  struct vm_area_struct *vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
+void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
  struct vm_area_struct *vma);
@@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed(struct vm_fault 
*vmf, pud_t orig_pud)
 }
 #endif
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
   unsigned long addr, pmd_t *pmd,
   unsigned int flags);
@@ -286,7 +286,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 
@@ -432,8 +432,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
return NULL;
 }
 
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf,
-   pmd_t orig_pmd)
+static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..899f55d46fba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -528,6 +528,7 @@ struct vm_fault {
 * the 'address'
 */
pte_t orig_pte; /* Value of PTE at the time of fault */
+   pmd_t orig_pmd; /* Value of PMD at the time of fault */
 
struct page *cow_page;  /* Page handler may use for COW fault */
struct page *page;  /* ->fault handlers should return a
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..53f3843ce72a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1252,11 +1252,12 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t 
orig_pud)
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
+void huge_pmd_set_accessed(struct vm_fault *vmf)
 {
pmd_t entry;
unsigned long haddr;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+   pmd_t orig_pmd = vmf->orig_pmd;
 
vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
@@ -1273,11 +1274,12 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t 
orig_pmd)
spin_unlock(vmf->ptl);
 }
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+   pmd_t orig_pmd = vmf->orig_pmd;
 
vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1413,9 +1415,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
+   pmd_t pmd = vmf->orig_pmd;
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..33be5811ac65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4193,12 +4193,12 @@ static inline vm_fault_t create_huge_pmd(struct 
vm_fault *vmf)
 }
 
 /* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
+static inline vm_fau

[PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP

2021-03-29 Thread Yang Shi

In the following patch the migrate_misplaced_page() will be used to migrate THP
for NUMA faul too.  Prepare to deal with THP.

Signed-off-by: Yang Shi 
---
 include/linux/migrate.h | 6 --
 mm/memory.c | 2 +-
 mm/migrate.c| 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..6abd34986cc5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -102,14 +102,16 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_area_struct *vma, int node,
+ bool compound);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_area_struct *vma, int node,
+bool compound)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 003bbf3187d4..7fed578bdc31 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4169,7 +4169,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vma, target_nid, false);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..9c4ae5132919 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2127,7 +2127,7 @@ static inline bool is_shared_exec_page(struct 
vm_area_struct *vma,
  * the page that will be dropped by this function before returning.
  */
 int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
-  int node)
+  int node, bool compound)
 {
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
-- 
2.26.2

Re: [PATCH v3 1/5] mm/migrate.c: make putback_movable_page() static

2021-03-25 Thread Yang Shi

On Thu, Mar 25, 2021 at 6:16 AM Miaohe Lin  wrote:
>
> The putback_movable_page() is just called by putback_movable_pages() and
> we know the page is locked and both PageMovable() and PageIsolated() is
> checked right before calling putback_movable_page(). So we make it static
> and remove all the 3 VM_BUG_ON_PAGE().

Reviewed-by: Yang Shi 

>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/migrate.h | 1 -
>  mm/migrate.c| 7 +--
>  2 files changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index fdf65f23acec..1d8095069b1c 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -44,7 +44,6 @@ extern int migrate_pages(struct list_head *l, new_page_t 
> new, free_page_t free,
> unsigned long private, enum migrate_mode mode, int reason);
>  extern struct page *alloc_migration_target(struct page *page, unsigned long 
> private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
> -extern void putback_movable_page(struct page *page);
>
>  extern void migrate_prep(void);
>  extern void migrate_prep_local(void);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 47df0df8f21a..61e7f848b554 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -140,15 +140,10 @@ int isolate_movable_page(struct page *page, 
> isolate_mode_t mode)
> return -EBUSY;
>  }
>
> -/* It should be called on page which is PG_movable */
> -void putback_movable_page(struct page *page)
> +static void putback_movable_page(struct page *page)
>  {
> struct address_space *mapping;
>
> -   VM_BUG_ON_PAGE(!PageLocked(page), page);
> -   VM_BUG_ON_PAGE(!PageMovable(page), page);
> -   VM_BUG_ON_PAGE(!PageIsolated(page), page);
> -
> mapping = page_mapping(page);
> mapping->a_ops->putback_page(page);
> __ClearPageIsolated(page);
> --
> 2.19.1
>

Re: [PATCH v2 5/5] mm/migrate.c: fix potential deadlock in NUMA balancing shared exec THP case

2021-03-23 Thread Yang Shi

On Tue, Mar 23, 2021 at 10:17 AM Yang Shi  wrote:
>
> On Tue, Mar 23, 2021 at 6:55 AM Miaohe Lin  wrote:
> >
> > Since commit c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA
> > balancing"), the NUMA balancing would skip shared exec transhuge page.
> > But this enhancement is not suitable for transhuge page. Because it's
> > required that page_mapcount() must be 1 due to no migration pte dance
> > is done here. On the other hand, the shared exec transhuge page will
> > leave the migrate_misplaced_page() with pte entry untouched and page
> > locked. Thus pagefault for NUMA will be triggered again and deadlock
> > occurs when we start waiting for the page lock held by ourselves.
>
> Thanks for catching this. By relooking the code I think the other
> important reason for removing this is
> migrate_misplaced_transhuge_page() actually can't see shared exec file
> THP at all since page_lock_anon_vma_read() is called before and if
> page is not anonymous page it will just restore the PMD without
> migrating anything.
>
> The pages for private mapped file vma may be anonymous pages due to
> COW but they can't be THP so it won't trigger THP numa fault at all. I
> think this is why no bug was reported. I overlooked this in the first
> place.
>
> Your fix is correct, and please add the above justification to your commit 
> log.

BTW, I think you can just undo or revert commit c77c5cbafe54 ("mm:
migrate: skip shared exec THP for NUMA balancing").

Thanks,
Yang

>
> Reviewed-by: Yang Shi 
>
> >
> > Fixes: c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA balancing")
> > Signed-off-by: Miaohe Lin 
> > ---
> >  mm/migrate.c | 4 
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5357a8527ca2..68bfa1625898 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2192,9 +2192,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
> > *mm,
> > int page_lru = page_is_file_lru(page);
> > unsigned long start = address & HPAGE_PMD_MASK;
> >
> > -   if (is_shared_exec_page(vma, page))
> > -   goto out;
> > -
> > new_page = alloc_pages_node(node,
> > (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> > HPAGE_PMD_ORDER);
> > @@ -2306,7 +2303,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
> > *mm,
> >
> >  out_unlock:
> > unlock_page(page);
> > -out:
> > put_page(page);
> > return 0;
> >  }
> > --
> > 2.19.1
> >

Re: [PATCH v2 1/5] mm/migrate.c: remove unnecessary VM_BUG_ON_PAGE on putback_movable_page()

2021-03-23 Thread Yang Shi

On Tue, Mar 23, 2021 at 6:54 AM Miaohe Lin  wrote:
>
> The !PageLocked() check is implicitly done in PageMovable(). Remove this
> explicit one.

TBH, I'm a little bit reluctant to have this kind change. If "locked"
check is necessary we'd better make it explicit otherwise just remove
it.

And why not just remove all the 3 VM_BUG_ON_PAGE since
putback_movable_page() is just called by putback_movable_pages() and
we know the page is locked and both PageMovable and PageIsolated is
checked right before calling putback_movable_page().

And you also could make putback_movable_page() static.

> Signed-off-by: Miaohe Lin 
> ---
>  mm/migrate.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 47df0df8f21a..facec65c7374 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -145,7 +145,6 @@ void putback_movable_page(struct page *page)
>  {
> struct address_space *mapping;
>
> -   VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(!PageMovable(page), page);
> VM_BUG_ON_PAGE(!PageIsolated(page), page);
>
> --
> 2.19.1
>

Re: [PATCH v2 2/5] mm/migrate.c: remove unnecessary rc != MIGRATEPAGE_SUCCESS check in 'else' case

2021-03-23 Thread Yang Shi

On Tue, Mar 23, 2021 at 6:54 AM Miaohe Lin  wrote:
>
> It's guaranteed that in the 'else' case of the rc == MIGRATEPAGE_SUCCESS
> check, rc does not equal to MIGRATEPAGE_SUCCESS. Remove this unnecessary
> check.

Reviewed-by: Yang Shi 

>
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/migrate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index facec65c7374..97da1fabdf72 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1374,7 +1374,7 @@ static int unmap_and_move_huge_page(new_page_t 
> get_new_page,
>  out:
> if (rc == MIGRATEPAGE_SUCCESS)
> putback_active_hugepage(hpage);
> -   else if (rc != -EAGAIN && rc != MIGRATEPAGE_SUCCESS)
> +   else if (rc != -EAGAIN)
> list_move_tail(>lru, ret);
>
> /*
> --
> 2.19.1
>

Re: [PATCH v2 5/5] mm/migrate.c: fix potential deadlock in NUMA balancing shared exec THP case

2021-03-23 Thread Yang Shi

On Tue, Mar 23, 2021 at 6:55 AM Miaohe Lin  wrote:
>
> Since commit c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA
> balancing"), the NUMA balancing would skip shared exec transhuge page.
> But this enhancement is not suitable for transhuge page. Because it's
> required that page_mapcount() must be 1 due to no migration pte dance
> is done here. On the other hand, the shared exec transhuge page will
> leave the migrate_misplaced_page() with pte entry untouched and page
> locked. Thus pagefault for NUMA will be triggered again and deadlock
> occurs when we start waiting for the page lock held by ourselves.

Thanks for catching this. By relooking the code I think the other
important reason for removing this is
migrate_misplaced_transhuge_page() actually can't see shared exec file
THP at all since page_lock_anon_vma_read() is called before and if
page is not anonymous page it will just restore the PMD without
migrating anything.

The pages for private mapped file vma may be anonymous pages due to
COW but they can't be THP so it won't trigger THP numa fault at all. I
think this is why no bug was reported. I overlooked this in the first
place.

Your fix is correct, and please add the above justification to your commit log.

Reviewed-by: Yang Shi 

>
> Fixes: c77c5cbafe54 ("mm: migrate: skip shared exec THP for NUMA balancing")
> Signed-off-by: Miaohe Lin 
> ---
>  mm/migrate.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5357a8527ca2..68bfa1625898 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2192,9 +2192,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
> *mm,
> int page_lru = page_is_file_lru(page);
> unsigned long start = address & HPAGE_PMD_MASK;
>
> -   if (is_shared_exec_page(vma, page))
> -   goto out;
> -
> new_page = alloc_pages_node(node,
> (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> HPAGE_PMD_ORDER);
> @@ -2306,7 +2303,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
> *mm,
>
>  out_unlock:
> unlock_page(page);
> -out:
> put_page(page);
> return 0;
>  }
> --
> 2.19.1
>

Re: [PATCH v5 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-22 Thread Yang Shi

On Sun, Mar 21, 2021 at 7:11 PM Zi Yan  wrote:
>
> On 19 Mar 2021, at 19:37, Yang Shi wrote:
>
> > On Thu, Mar 18, 2021 at 5:52 PM Zi Yan  wrote:
> >>
> >> From: Zi Yan 
> >>
> >> We did not have a direct user interface of splitting the compound page
> >> backing a THP and there is no need unless we want to expose the THP
> >> implementation details to users. Make /split_huge_pages accept
> >> a new command to do that.
> >>
> >> By writing ",," to
> >> /split_huge_pages, THPs within the given virtual address range
> >> from the process with the given pid are split. It is used to test
> >> split_huge_page function. In addition, a selftest program is added to
> >> tools/testing/selftests/vm to utilize the interface by splitting
> >> PMD THPs and PTE-mapped THPs.
> >>
> >> This does not change the old behavior, i.e., writing 1 to the interface
> >> to split all THPs in the system.
> >>
> >> Changelog:
> >>
> >> From v5:
> >> 1. Skipped special VMAs and other fixes. (suggested by Yang Shi)
> >
> > Looks good to me. Reviewed-by: Yang Shi 
> >
> > Some nits below:
> >
> >>
> >> From v4:
> >> 1. Fixed the error code return issue, spotted by kernel test robot
> >>.
> >>
> >> From v3:
> >> 1. Factored out split huge pages in the given pid code to a separate
> >>function.
> >> 2. Added the missing put_page for not split pages.
> >> 3. pr_debug -> pr_info, make reading results simpler.
> >>
> >> From v2:
> >> 1. Reused existing /split_huge_pages interface. (suggested by
> >>Yang Shi)
> >>
> >> From v1:
> >> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
> >>robot .
> >> 2. Dropped the use of find_mm_struct and code it directly, since there
> >>is no need for the permission check in that function and the function
> >>is only available when migration is on.
> >> 3. Added some comments in the selftest program to clarify how PTE-mapped
> >>THPs are formed.
> >>
> >> Signed-off-by: Zi Yan 
> >> ---
> >>  mm/huge_memory.c  | 143 +++-
> >>  tools/testing/selftests/vm/.gitignore |   1 +
> >>  tools/testing/selftests/vm/Makefile   |   1 +
> >>  .../selftests/vm/split_huge_page_test.c   | 318 ++
> >>  4 files changed, 456 insertions(+), 7 deletions(-)
> >>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index bff92dea5ab3..9bf9bc489228 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -7,6 +7,7 @@
> >>
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
> >>  };
> >>
> >>  #ifdef CONFIG_DEBUG_FS
> >> -static int split_huge_pages_set(void *data, u64 val)
> >> +static void split_huge_pages_all(void)
> >>  {
> >> struct zone *zone;
> >> struct page *page;
> >> unsigned long pfn, max_zone_pfn;
> >> unsigned long total = 0, split = 0;
> >>
> >> -   if (val != 1)
> >> -   return -EINVAL;
> >> -
> >> +   pr_info("Split all THPs\n");
> >> for_each_populated_zone(zone) {
> >> max_zone_pfn = zone_end_pfn(zone);
> >> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; 
> >> pfn++) {
> >> @@ -2959,11 +2958,141 @@ static int split_huge_pages_set(void *data, u64 
> >> val)
> >> }
> >>
> >> pr_info("%lu of %lu THP split\n", split, total);
> >> +}
> >>
> >> -   return 0;
> >> +static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> >> +   unsigned long vaddr_end)
> >> +{
> >> +   int ret = 0;
> >> +   struct task_struct *task;
> >> +   struct mm_struct *mm;
> >> +   unsigned long total = 0, split = 0;
> >> +   unsigned long addr;
> >> +
> >> +   vaddr_start &= PAGE_MASK;
> >> +   vaddr_e

Re: [PATCH v5 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-19 Thread Yang Shi

On Thu, Mar 18, 2021 at 5:52 PM Zi Yan  wrote:
>
> From: Zi Yan 
>
> Further extend /split_huge_pages to accept
> ",," for file-backed THP split tests since
> tmpfs may have file backed by THP that mapped nowhere.
>
> Update selftest program to test file-backed THP split too.
>
> Suggested-by: Kirill A. Shutemov 
> Signed-off-by: Zi Yan 
> ---
>  mm/huge_memory.c  | 97 ++-
>  .../selftests/vm/split_huge_page_test.c   | 79 ++-
>  2 files changed, 168 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9bf9bc489228..6d6537cc8c56 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3049,12 +3049,74 @@ static int split_huge_pages_pid(int pid, unsigned 
> long vaddr_start,
> return ret;
>  }
>
> +static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
> +   pgoff_t off_end)
> +{
> +   struct filename *file;
> +   struct file *candidate;
> +   struct address_space *mapping;
> +   int ret = -EINVAL;
> +   pgoff_t off_cur;
> +   unsigned long total = 0, split = 0;
> +
> +   file = getname_kernel(file_path);
> +   if (IS_ERR(file))
> +   return ret;
> +
> +   candidate = file_open_name(file, O_RDONLY, 0);
> +   if (IS_ERR(candidate))
> +   goto out;
> +
> +   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 
> 0x%lx]\n",
> +file_path, off_start, off_end);
> +
> +   mapping = candidate->f_mapping;
> +
> +   for (off_cur = off_start; off_cur < off_end;) {
> +   struct page *fpage = pagecache_get_page(mapping, off_cur,
> +   FGP_ENTRY | FGP_HEAD, 0);
> +
> +   if (xa_is_value(fpage) || !fpage) {
> +   off_cur += PAGE_SIZE;
> +   continue;
> +   }
> +
> +   if (!is_transparent_hugepage(fpage)) {
> +   off_cur += PAGE_SIZE;
> +   goto next;
> +   }
> +   total++;
> +   off_cur = fpage->index + thp_size(fpage);
> +
> +   if (!trylock_page(fpage))
> +   goto next;
> +
> +   if (!split_huge_page(fpage))
> +   split++;
> +
> +   unlock_page(fpage);
> +next:
> +   put_page(fpage);
> +   }
> +
> +   filp_close(candidate, NULL);
> +   ret = 0;
> +
> +   pr_info("%lu of %lu file-backed THP split\n", split, total);
> +out:
> +   putname(file);
> +   return ret;
> +}
> +
> +#define MAX_INPUT_BUF_SZ 255

As I mentioned in the first patch, you may move this to the first
patch. I don't think it is necessary to add some code then remove it
right in the following patch.

Otherwise the patch looks good to me. Reviewed-by: Yang Shi


> +
>  static ssize_t split_huge_pages_write(struct file *file, const char __user 
> *buf,
> size_t count, loff_t *ppops)
>  {
> static DEFINE_MUTEX(split_debug_mutex);
> ssize_t ret;
> -   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
> +   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end 
> */
> +   char input_buf[MAX_INPUT_BUF_SZ];
> int pid;
> unsigned long vaddr_start, vaddr_end;
>
> @@ -3064,11 +3126,40 @@ static ssize_t split_huge_pages_write(struct file 
> *file, const char __user *buf,
>
> ret = -EFAULT;
>
> -   memset(input_buf, 0, 80);
> +   memset(input_buf, 0, MAX_INPUT_BUF_SZ);
> if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
> goto out;
>
> -   input_buf[79] = '\0';
> +   input_buf[MAX_INPUT_BUF_SZ - 1] = '\0';
> +
> +   if (input_buf[0] == '/') {
> +   char *tok;
> +   char *buf = input_buf;
> +   char file_path[MAX_INPUT_BUF_SZ];
> +   pgoff_t off_start = 0, off_end = 0;
> +   size_t input_len = strlen(input_buf);
> +
> +   tok = strsep(, ",");
> +   if (tok) {
> +   strncpy(file_path, tok, MAX_INPUT_BUF_SZ);
> +   } else {
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +
> +   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
> +   if (ret != 2) {
> +   pr_info("ret: %ld\n",

Re: [PATCH v5 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-19 Thread Yang Shi

On Thu, Mar 18, 2021 at 5:52 PM Zi Yan  wrote:
>
> From: Zi Yan 
>
> We did not have a direct user interface of splitting the compound page
> backing a THP and there is no need unless we want to expose the THP
> implementation details to users. Make /split_huge_pages accept
> a new command to do that.
>
> By writing ",," to
> /split_huge_pages, THPs within the given virtual address range
> from the process with the given pid are split. It is used to test
> split_huge_page function. In addition, a selftest program is added to
> tools/testing/selftests/vm to utilize the interface by splitting
> PMD THPs and PTE-mapped THPs.
>
> This does not change the old behavior, i.e., writing 1 to the interface
> to split all THPs in the system.
>
> Changelog:
>
> From v5:
> 1. Skipped special VMAs and other fixes. (suggested by Yang Shi)

Looks good to me. Reviewed-by: Yang Shi 

Some nits below:

>
> From v4:
> 1. Fixed the error code return issue, spotted by kernel test robot
>.
>
> From v3:
> 1. Factored out split huge pages in the given pid code to a separate
>function.
> 2. Added the missing put_page for not split pages.
> 3. pr_debug -> pr_info, make reading results simpler.
>
> From v2:
> 1. Reused existing /split_huge_pages interface. (suggested by
>Yang Shi)
>
> From v1:
> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>robot .
> 2. Dropped the use of find_mm_struct and code it directly, since there
>is no need for the permission check in that function and the function
>is only available when migration is on.
> 3. Added some comments in the selftest program to clarify how PTE-mapped
>THPs are formed.
>
> Signed-off-by: Zi Yan 
> ---
>  mm/huge_memory.c  | 143 +++-
>  tools/testing/selftests/vm/.gitignore |   1 +
>  tools/testing/selftests/vm/Makefile   |   1 +
>  .../selftests/vm/split_huge_page_test.c   | 318 ++
>  4 files changed, 456 insertions(+), 7 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bff92dea5ab3..9bf9bc489228 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -7,6 +7,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
>  };
>
>  #ifdef CONFIG_DEBUG_FS
> -static int split_huge_pages_set(void *data, u64 val)
> +static void split_huge_pages_all(void)
>  {
> struct zone *zone;
> struct page *page;
> unsigned long pfn, max_zone_pfn;
> unsigned long total = 0, split = 0;
>
> -   if (val != 1)
> -   return -EINVAL;
> -
> +   pr_info("Split all THPs\n");
> for_each_populated_zone(zone) {
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> @@ -2959,11 +2958,141 @@ static int split_huge_pages_set(void *data, u64 val)
> }
>
> pr_info("%lu of %lu THP split\n", split, total);
> +}
>
> -   return 0;
> +static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> +   unsigned long vaddr_end)
> +{
> +   int ret = 0;
> +   struct task_struct *task;
> +   struct mm_struct *mm;
> +   unsigned long total = 0, split = 0;
> +   unsigned long addr;
> +
> +   vaddr_start &= PAGE_MASK;
> +   vaddr_end &= PAGE_MASK;
> +
> +   /* Find the task_struct from pid */
> +   rcu_read_lock();
> +   task = find_task_by_vpid(pid);
> +   if (!task) {
> +   rcu_read_unlock();
> +   ret = -ESRCH;
> +   goto out;
> +   }
> +   get_task_struct(task);
> +   rcu_read_unlock();
> +
> +   /* Find the mm_struct */
> +   mm = get_task_mm(task);
> +   put_task_struct(task);
> +
> +   if (!mm) {
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +
> +   pr_info("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
> +pid, vaddr_start, vaddr_end);
> +
> +   mmap_read_lock(mm);
> +   /*
> +* always increase addr by PAGE_SIZE, since we could have a PTE page
> +* table filled with PTE-mapped THPs, each of which is distinct.
> +*/
> +   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
> +   struct vm_area_struct *vma = find_vma(mm, addr);

Re: [PATCH v4 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-17 Thread Yang Shi

On Wed, Mar 17, 2021 at 8:00 AM Zi Yan  wrote:
>
> On 16 Mar 2021, at 19:18, Yang Shi wrote:
>
> > On Mon, Mar 15, 2021 at 1:34 PM Zi Yan  wrote:
> >>
> >> From: Zi Yan 
> >>
> >> Further extend /split_huge_pages to accept
> >> ",," for file-backed THP split tests since
> >> tmpfs may have file backed by THP that mapped nowhere.
> >>
> >> Update selftest program to test file-backed THP split too.
> >>
> >> Suggested-by: Kirill A. Shutemov 
> >> Signed-off-by: Zi Yan 
> >> ---
> >>  mm/huge_memory.c  | 95 ++-
> >>  .../selftests/vm/split_huge_page_test.c   | 79 ++-
> >>  2 files changed, 166 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 3bfee54e2cd0..da91ee97d944 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3043,12 +3043,72 @@ static int split_huge_pages_pid(int pid, unsigned 
> >> long vaddr_start,
> >> return ret;
> >>  }
> >>
> >> +static int split_huge_pages_in_file(const char *file_path, pgoff_t 
> >> off_start,
> >> +   pgoff_t off_end)
> >> +{
> >> +   struct filename *file;
> >> +   struct file *candidate;
> >> +   struct address_space *mapping;
> >> +   int ret = -EINVAL;
> >> +   pgoff_t off_cur;
> >> +   unsigned long total = 0, split = 0;
> >> +
> >> +   file = getname_kernel(file_path);
> >> +   if (IS_ERR(file))
> >> +   return ret;
> >> +
> >> +   candidate = file_open_name(file, O_RDONLY, 0);
> >> +   if (IS_ERR(candidate))
> >> +   goto out;
> >> +
> >> +   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 
> >> 0x%lx]\n",
> >> +file_path, off_start, off_end);
> >> +
> >> +   mapping = candidate->f_mapping;
> >> +
> >> +   for (off_cur = off_start; off_cur < off_end;) {
> >> +   struct page *fpage = pagecache_get_page(mapping, off_cur,
> >> +   FGP_ENTRY | FGP_HEAD, 0);
> >> +
> >> +   if (xa_is_value(fpage) || !fpage) {
> >
> > Why do you have FGP_ENTRY? It seems it would return page instead of
> > NULL if page is value. So I think you could remove FGP_ENTRY and
> > xa_is_value() check as well.
>
> The comment on FGP_ENTRY says “If there is a shadow/swap/DAX entry, return
> it instead of allocating a new page to replace it”. I do not think we
> want to allocate new pages here. I mostly follow the use of 
> pagecache_get_page()
> in shmem_getpage_gfp without swapin or allocating new pages.

Yes, you are correct. I overlooked that.

>
> >
> >> +   off_cur += PAGE_SIZE;
> >> +   continue;
> >> +   }
> >> +
> >> +   if (!is_transparent_hugepage(fpage)) {
> >> +   off_cur += PAGE_SIZE;
> >> +   goto next;
> >> +   }
> >> +   total++;
> >> +   off_cur = fpage->index + thp_size(fpage);
> >> +
> >> +   if (!trylock_page(fpage))
> >> +   goto next;
> >> +
> >> +   if (!split_huge_page(fpage))
> >> +   split++;
> >> +
> >> +   unlock_page(fpage);
> >> +next:
> >> +   put_page(fpage);
> >> +   }
> >> +
> >> +   filp_close(candidate, NULL);
> >> +   ret = 0;
> >> +
> >> +   pr_info("%lu of %lu file-backed THP split\n", split, total);
> >> +out:
> >> +   putname(file);
> >> +   return ret;
> >> +}
> >> +
> >>  static ssize_t split_huge_pages_write(struct file *file, const char 
> >> __user *buf,
> >> size_t count, loff_t *ppops)
> >>  {
> >> static DEFINE_MUTEX(mutex);
> >> ssize_t ret;
> >> -   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
> >> +   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, 
> >> off_end */
> >> +   char input_buf[MAX_INPUT];
> >
> > I d

Re: [PATCH v4 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-16 Thread Yang Shi

On Mon, Mar 15, 2021 at 1:34 PM Zi Yan  wrote:
>
> From: Zi Yan 
>
> Further extend /split_huge_pages to accept
> ",," for file-backed THP split tests since
> tmpfs may have file backed by THP that mapped nowhere.
>
> Update selftest program to test file-backed THP split too.
>
> Suggested-by: Kirill A. Shutemov 
> Signed-off-by: Zi Yan 
> ---
>  mm/huge_memory.c  | 95 ++-
>  .../selftests/vm/split_huge_page_test.c   | 79 ++-
>  2 files changed, 166 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3bfee54e2cd0..da91ee97d944 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3043,12 +3043,72 @@ static int split_huge_pages_pid(int pid, unsigned 
> long vaddr_start,
> return ret;
>  }
>
> +static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
> +   pgoff_t off_end)
> +{
> +   struct filename *file;
> +   struct file *candidate;
> +   struct address_space *mapping;
> +   int ret = -EINVAL;
> +   pgoff_t off_cur;
> +   unsigned long total = 0, split = 0;
> +
> +   file = getname_kernel(file_path);
> +   if (IS_ERR(file))
> +   return ret;
> +
> +   candidate = file_open_name(file, O_RDONLY, 0);
> +   if (IS_ERR(candidate))
> +   goto out;
> +
> +   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 
> 0x%lx]\n",
> +file_path, off_start, off_end);
> +
> +   mapping = candidate->f_mapping;
> +
> +   for (off_cur = off_start; off_cur < off_end;) {
> +   struct page *fpage = pagecache_get_page(mapping, off_cur,
> +   FGP_ENTRY | FGP_HEAD, 0);
> +
> +   if (xa_is_value(fpage) || !fpage) {

Why do you have FGP_ENTRY? It seems it would return page instead of
NULL if page is value. So I think you could remove FGP_ENTRY and
xa_is_value() check as well.


> +   off_cur += PAGE_SIZE;
> +   continue;
> +   }
> +
> +   if (!is_transparent_hugepage(fpage)) {
> +   off_cur += PAGE_SIZE;
> +   goto next;
> +   }
> +   total++;
> +   off_cur = fpage->index + thp_size(fpage);
> +
> +   if (!trylock_page(fpage))
> +   goto next;
> +
> +   if (!split_huge_page(fpage))
> +   split++;
> +
> +   unlock_page(fpage);
> +next:
> +   put_page(fpage);
> +   }
> +
> +   filp_close(candidate, NULL);
> +   ret = 0;
> +
> +   pr_info("%lu of %lu file-backed THP split\n", split, total);
> +out:
> +   putname(file);
> +   return ret;
> +}
> +
>  static ssize_t split_huge_pages_write(struct file *file, const char __user 
> *buf,
> size_t count, loff_t *ppops)
>  {
> static DEFINE_MUTEX(mutex);
> ssize_t ret;
> -   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
> +   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end 
> */
> +   char input_buf[MAX_INPUT];

I didn't find where MAX_INPUT is defined in your patch. Just saw
include/uapi/linux/limits.h have it defined. Is it the one you really
refer to?

> int pid;
> unsigned long vaddr_start, vaddr_end;
>
> @@ -3058,11 +3118,40 @@ static ssize_t split_huge_pages_write(struct file 
> *file, const char __user *buf,
>
> ret = -EFAULT;
>
> -   memset(input_buf, 0, 80);
> +   memset(input_buf, 0, MAX_INPUT);
> if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
> goto out;
>
> -   input_buf[79] = '\0';
> +   input_buf[MAX_INPUT - 1] = '\0';
> +
> +   if (input_buf[0] == '/') {
> +   char *tok;
> +   char *buf = input_buf;
> +   char file_path[MAX_INPUT];
> +   pgoff_t off_start = 0, off_end = 0;
> +   size_t input_len = strlen(input_buf);
> +
> +   tok = strsep(, ",");
> +   if (tok) {
> +   strncpy(file_path, tok, MAX_INPUT);
> +   } else {
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +
> +   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
> +   if (ret != 2) {
> +   pr_info("ret: %ld\n", ret);
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +   ret = split_huge_pages_in_file(file_path, off_start, off_end);
> +   if (!ret)
> +   ret = input_len;
> +
> +   goto out;
> +   }
> +
> ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
> _end);
> if (ret == 1 && pid == 1) {
> split_huge_pages_all();
> diff

Re: [PATCH v4 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-16 Thread Yang Shi

On Mon, Mar 15, 2021 at 1:34 PM Zi Yan  wrote:
>
> From: Zi Yan 
>
> We did not have a direct user interface of splitting the compound page
> backing a THP and there is no need unless we want to expose the THP
> implementation details to users. Make /split_huge_pages accept
> a new command to do that.
>
> By writing ",," to
> /split_huge_pages, THPs within the given virtual address range
> from the process with the given pid are split. It is used to test
> split_huge_page function. In addition, a selftest program is added to
> tools/testing/selftests/vm to utilize the interface by splitting
> PMD THPs and PTE-mapped THPs.
>
> This does not change the old behavior, i.e., writing 1 to the interface
> to split all THPs in the system.
>
> Changelog:
>
> From v3:
> 1. Factored out split huge pages in the given pid code to a separate
>function.
> 2. Added the missing put_page for not split pages.
> 3. pr_debug -> pr_info, make reading results simpler.
>
> From v2:
>
> 1. Reused existing /split_huge_pages interface. (suggested by
>Yang Shi)
>
> From v1:
>
> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>robot .
> 2. Dropped the use of find_mm_struct and code it directly, since there
>is no need for the permission check in that function and the function
>is only available when migration is on.
> 3. Added some comments in the selftest program to clarify how PTE-mapped
>THPs are formed.
>
> Signed-off-by: Zi Yan 
> ---
>  mm/huge_memory.c  | 136 +++-
>  tools/testing/selftests/vm/.gitignore |   1 +
>  tools/testing/selftests/vm/Makefile   |   1 +
>  .../selftests/vm/split_huge_page_test.c   | 313 ++
>  4 files changed, 444 insertions(+), 7 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bff92dea5ab3..3bfee54e2cd0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -7,6 +7,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
>  };
>
>  #ifdef CONFIG_DEBUG_FS
> -static int split_huge_pages_set(void *data, u64 val)
> +static void split_huge_pages_all(void)
>  {
> struct zone *zone;
> struct page *page;
> unsigned long pfn, max_zone_pfn;
> unsigned long total = 0, split = 0;
>
> -   if (val != 1)
> -   return -EINVAL;
> -
> +   pr_info("Split all THPs\n");
> for_each_populated_zone(zone) {
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> @@ -2959,11 +2958,134 @@ static int split_huge_pages_set(void *data, u64 val)
> }
>
> pr_info("%lu of %lu THP split\n", split, total);
> +}
>
> -   return 0;
> +static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> +   unsigned long vaddr_end)
> +{
> +   int ret = 0;
> +   struct task_struct *task;
> +   struct mm_struct *mm;
> +   unsigned long total = 0, split = 0;
> +   unsigned long addr;
> +
> +   vaddr_start &= PAGE_MASK;
> +   vaddr_end &= PAGE_MASK;
> +
> +   /* Find the task_struct from pid */
> +   rcu_read_lock();
> +   task = find_task_by_vpid(pid);
> +   if (!task) {
> +   rcu_read_unlock();
> +   ret = -ESRCH;
> +   goto out;
> +   }
> +   get_task_struct(task);
> +   rcu_read_unlock();
> +
> +   /* Find the mm_struct */
> +   mm = get_task_mm(task);
> +   put_task_struct(task);
> +
> +   if (!mm) {
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +
> +   pr_info("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
> +pid, vaddr_start, vaddr_end);
> +
> +   mmap_read_lock(mm);
> +   /*
> +* always increase addr by PAGE_SIZE, since we could have a PTE page
> +* table filled with PTE-mapped THPs, each of which is distinct.
> +*/
> +   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
> +   struct vm_area_struct *vma = find_vma(mm, addr);
> +   unsigned int follflags;
> +   struct page *page;
> +
> +   if (!vma || addr < vma->vm_start)
> +   break;

I think you could skip some special vmas, i.e. VM_HUGETL

Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem

2021-03-15 Thread Yang Shi

On Mon, Mar 15, 2021 at 12:15 PM Roman Gushchin  wrote:
>
>
> On Mon, Mar 15, 2021 at 07:49:57PM +0100, Vlastimil Babka wrote:
> > On 3/9/21 4:25 PM, Xunlei Pang wrote:
> > > count_partial() can hold n->list_lock spinlock for quite long, which
> > > makes much trouble to the system. This series eliminate this problem.
> >
> > Before I check the details, I have two high-level comments:
> >
> > - patch 1 introduces some counting scheme that patch 4 then changes, could 
> > we do
> > this in one step to avoid the churn?
> >
> > - the series addresses the concern that spinlock is being held, but doesn't
> > address the fact that counting partial per-node slabs is not nearly enough 
> > if we
> > want accurate  in /proc/slabinfo because there are also percpu
> > slabs and per-cpu partial slabs, where we don't track the free objects at 
> > all.
> > So after this series while the readers of /proc/slabinfo won't block the
> > spinlock, they will get the same garbage data as before. So Christoph is not
> > wrong to say that we can just report active_objs == num_objs and it won't
> > actually break any ABI.
> > At the same time somebody might actually want accurate object statistics at 
> > the
> > expense of peak performance, and it would be nice to give them such option 
> > in
> > SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> > although that option provides many additional tuning stats, with additional
> > overhead.
> > So my proposal would be a new config for "accurate active objects" (or just 
> > tie
> > it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters 
> > in
> > patch 4 to all alloc/free, so that it includes percpu slabs. Without this 
> > config
> > enabled, let's just report active_objs == num_objs.
>
> It sounds really good to me! The only thing, I'd avoid introducing a new 
> option
> and use CONFIG_SLUB_STATS instead.
>
> It seems like CONFIG_SLUB_DEBUG is a more popular option than 
> CONFIG_SLUB_STATS.
> CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is 
> off.
> I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.

I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
it is supposed not incur too much overhead unless specific debug (i.e.
red_zone) is turned on on demand.

>
> Thanks!
>
> >
> > Vlastimil
> >
> > > v1->v2:
> > > - Improved changelog and variable naming for PATCH 1~2.
> > > - PATCH3 adds per-cpu counter to avoid performance regression
> > >   in concurrent __slab_free().
> > >
> > > v2->v3:
> > > - Changed "page->inuse" to the safe "new.inuse", etc.
> > > - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
> > > - atomic_long_t -> unsigned long
> > >
> > > [Testing]
> > > There seems might be a little performance impact under extreme
> > > __slab_free() concurrent calls according to my tests.
> > >
> > > On my 32-cpu 2-socket physical machine:
> > > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> > >
> > > 1) perf stat --null --repeat 10 -- hackbench 20 thread 2
> > >
> > > == original, no patched
> > > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > >
> > >   24.536050899 seconds time elapsed   
> > >( +-  0.24% )
> > >
> > >
> > > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > >
> > >   24.588049142 seconds time elapsed   
> > >( +-  0.35% )
> > >
> > >
> > > == patched with patch1~4
> > > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > >
> > >   24.670892273 seconds time elapsed   
> > >( +-  0.29% )
> > >
> > >
> > > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > >
> > >   24.746755689 seconds time elapsed   
> > >( +-  0.21% )
> > >
> > >
> > > 2) perf stat --null --repeat 10 -- hackbench 32 thread 2
> > >
> > > == original, no patched
> > >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > >
> > >   39.784911855 seconds time elapsed   
> > >( +-  0.14% )
> > >
> > >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > >
> > >   39.868687608 seconds time elapsed   
> > >( +-  0.19% )
> > >
> > > == patched with patch1~4
> > >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > >
> > >   39.681273015 seconds time elapsed   
> > >( +-  0.21% )
> > >
> > >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > >
> > >   39.681238459 seconds time elapsed   
> > >( +-  0.09% )
> > >
> > >
> > > Xunlei Pang (4):
> > >   mm/slub: Introduce two counters for partial objects
> > >   mm/slub: Get rid of count_partial()

Re: [PATCH v3] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-15 Thread Yang Shi

On Mon, Mar 15, 2021 at 11:37 AM Zi Yan  wrote:
>
> On 15 Mar 2021, at 8:07, Kirill A. Shutemov wrote:
>
> > On Thu, Mar 11, 2021 at 07:57:12PM -0500, Zi Yan wrote:
> >> From: Zi Yan 
> >>
> >> We do not have a direct user interface of splitting the compound page
> >> backing a THP
> >
> > But we do. You expand it.
> >
> >> and there is no need unless we want to expose the THP
> >> implementation details to users. Make /split_huge_pages accept
> >> a new command to do that.
> >>
> >> By writing ",," to
> >> /split_huge_pages, THPs within the given virtual address range
> >> from the process with the given pid are split. It is used to test
> >> split_huge_page function. In addition, a selftest program is added to
> >> tools/testing/selftests/vm to utilize the interface by splitting
> >> PMD THPs and PTE-mapped THPs.
> >>
> >
> > Okay, makes sense.
> >
> > But it doesn't cover non-mapped THPs. tmpfs may have file backed by THP
> > that mapped nowhere. Do we want to cover this case too?
>
> Sure. It would be useful when large page in page cache too. I will send
> v4 with tmpfs THP split. I will definitely need a review for it, since
> I am not familiar with getting a page from a file path.

We do have some APIs to return pages for a file range, i.e.

find_get_page
find_get_pages
find_get_entries
find_get_pages_range

They all need address_space, so you need to convert file path to
address_space before using them.

The hole punch of tmpfs uses find_get_entries(), just check what
shmem_undo_range() does.

>
> > Maybe have PID:,, and
> > FILE:,, ?
>
> Or just check input[0] == ‘/‘ for file path input.
>
>
> —
> Best Regards,
> Yan Zi

Re: [PATCH v1 00/14] Multigenerational LRU

2021-03-15 Thread Yang Shi

On Fri, Mar 12, 2021 at 11:57 PM Yu Zhao  wrote:
>
> TLDR
> 
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.
>
> Repo
> 
> git fetch https://linux-mm.googlesource.com/page-reclaim 
> refs/changes/01/1101/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
>
> Background
> ==
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> 
> Notion of the active/inactive
> -
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
>
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
>
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
>
> Incremental scans via the rmap
> --
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.

So, you mean the reclaim efficiency is just 1/7? It seems quite low.
Just out of curiosity, did you have more insights about why it is that
low? I think it heavily depends on workload. We have page cache heavy
workloads, the efficiency rate is quite high.

>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>4.63%  do_raw_spin_lock
>3.89%  vma_interval_tree_iter_next
>3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>7.61%  do_raw_spin_lock
>5.69%  vma_interval_tree_iter_next
>4.91%  vma_interval_tree_subtree_search
>3.71%  page_referenced_one

I guess it is because your workloads have a lot of shared anon pages?

>
> Solutions
> =
> Notion of generation numbers
> 
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
>
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.

It means you replace the active/inactive lists to multiple lists, from
most active to least active?

>
> Differential scans via page tables
> --
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.

How's about unmapped page caches? I think they are still quite common
for a lot of workloads.

>
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates

Re: [PATCH v2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-11 Thread Yang Shi

On Thu, Mar 11, 2021 at 7:52 AM Zi Yan  wrote:
>
> On 10 Mar 2021, at 20:12, Yang Shi wrote:
>
> > On Wed, Mar 10, 2021 at 7:36 AM Zi Yan  wrote:
> >>
> >> From: Zi Yan 
> >>
> >> We do not have a direct user interface of splitting the compound page
> >> backing a THP and there is no need unless we want to expose the THP
> >> implementation details to users. Adding an interface for debugging.
> >>
> >> By writing ",," to
> >> /split_huge_pages_in_range_pid, THPs within the given virtual
> >
> > Can we reuse the existing split_huge_page knob instead of creating a new 
> > one?
> >
> > Two knobs for splitting huge pages on debugging purpose seem
> > overkilling to me IMHO. I'm wondering if we could check if a special
> > value (e.g. 1 or -1) is written then split all THPs as split_huge_page
> > knob does?
> >
> > I don't think this interface is used widely so the risk should be very
> > low for breaking userspace.
>
> Thanks for the suggestion.
>
> I prefer a separate interface to keep input handling simpler. I am also
> planning to enhance this interface later to enable splitting huge pages
> to any lower order when Matthew Wilcox’s large page in page cache gets in,
> so it is better to keep it separate from existing split_huge_pages.

The input handling seems not that hard, you might be able to try to do:

ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", , _start,
_end, order);
switch(ret) {
case ret == 1:
split_all_thps
case ret == 3:
 split_thp_for_pid
case ret == 4:
 split_thp_for_pid_to_order
default:
 return -EINVAL
}

Will it work for you?

>
> —
> Best Regards,
> Yan Zi

[v10 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-11 Thread Yang Shi

The number of deferred objects might get windup to an absurd number, and it
results in clamp of slab objects.  It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice
of cache items.

The idea is borrowed from Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/

Tested with kernel build and vfs metadata heavy workload in our production
environment, no regression is spotted so far.

Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 46 +++---
 1 file changed, 11 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d0791ebd6761..163616e78a4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -664,7 +664,6 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
 */
nr = xchg_nr_deferred(shrinker, shrinkctl);
 
-   total_scan = nr;
if (shrinker->seeks) {
delta = freeable >> priority;
delta *= 4;
@@ -678,37 +677,9 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
delta = freeable / 2;
}
 
+   total_scan = nr >> priority;
total_scan += delta;
-   if (total_scan < 0) {
-   pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-  shrinker->scan_objects, total_scan);
-   total_scan = freeable;
-   next_deferred = nr;
-   } else
-   next_deferred = total_scan;
-
-   /*
-* We need to avoid excessive windup on filesystem shrinkers
-* due to large numbers of GFP_NOFS allocations causing the
-* shrinkers to return -1 all the time. This results in a large
-* nr being built up so when a shrink that can do some work
-* comes along it empties the entire cache due to nr >>>
-* freeable. This is bad for sustaining a working set in
-* memory.
-*
-* Hence only allow the shrinker to scan the entire cache when
-* a large delta change is calculated directly.
-*/
-   if (delta < freeable / 4)
-   total_scan = min(total_scan, freeable / 2);
-
-   /*
-* Avoid risking looping forever due to too large nr value:
-* never try to free more than twice the estimate number of
-* freeable entries.
-*/
-   if (total_scan > freeable * 2)
-   total_scan = freeable * 2;
+   total_scan = min(total_scan, (2 * freeable));
 
trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
   freeable, delta, total_scan, priority);
@@ -747,10 +718,15 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
cond_resched();
}
 
-   if (next_deferred >= scanned)
-   next_deferred -= scanned;
-   else
-   next_deferred = 0;
+   /*
+* The deferred work is increased by any new work (delta) that wasn't
+* done, decreased by old deferred work that was done now.
+*
+* And it is capped to two times of the freeable items.
+*/
+   next_deferred = max_t(long, (nr + delta - scanned), 0);
+   next_deferred = min(next_deferred, (2 * freeable));
+
/*
 * move the unused scan count back into the shrinker in a
 * manner that handles concurrent updates.
-- 
2.26.2

[v10 PATCH 12/13] mm: memcontrol: reparent nr_deferred when memcg offline

2021-03-11 Thread Yang Shi

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to 
parent's
corresponding nr_deferred when memcg offline.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c|  1 +
 mm/vmscan.c| 24 
 3 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 24e735434a46..4064c9dda534 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1537,6 +1537,7 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
+void reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #else
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35d44afdd9fc..a945dfc85156 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5167,6 +5167,7 @@ static void mem_cgroup_css_offline(struct 
cgroup_subsys_state *css)
page_counter_set_low(>memory, 0);
 
memcg_offline_kmem(memcg);
+   reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
 
drain_all_stock(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 324c34c6e5cf..d0791ebd6761 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -397,6 +397,30 @@ static long add_nr_deferred_memcg(long nr, int nid, struct 
shrinker *shrinker,
return atomic_long_add_return(nr, >nr_deferred[shrinker->id]);
 }
 
+void reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+   int i, nid;
+   long nr;
+   struct mem_cgroup *parent;
+   struct shrinker_info *child_info, *parent_info;
+
+   parent = parent_mem_cgroup(memcg);
+   if (!parent)
+   parent = root_mem_cgroup;
+
+   /* Prevent from concurrent shrinker_info expand */
+   down_read(_rwsem);
+   for_each_node(nid) {
+   child_info = shrinker_info_protected(memcg, nid);
+   parent_info = shrinker_info_protected(parent, nid);
+   for (i = 0; i < shrinker_nr_max; i++) {
+   nr = atomic_long_read(_info->nr_deferred[i]);
+   atomic_long_add(nr, _info->nr_deferred[i]);
+   }
+   }
+   up_read(_rwsem);
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
return sc->target_mem_cgroup;
-- 
2.26.2

[v10 PATCH 11/13] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

2021-03-11 Thread Yang Shi

Now nr_deferred is available on per memcg level for memcg aware shrinkers, so 
don't need
allocate shrinker->nr_deferred for such shrinkers anymore.

The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is 
disabled
by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be 
cleared.
This makes the implementation of this patch simpler.

Acked-by: Vlastimil Babka 
Reviewed-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5bc6975cb635..324c34c6e5cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -346,6 +346,9 @@ static int prealloc_memcg_shrinker(struct shrinker 
*shrinker)
 {
int id, ret = -ENOMEM;
 
+   if (mem_cgroup_disabled())
+   return -ENOSYS;
+
down_write(_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
@@ -425,7 +428,7 @@ static bool writeback_throttling_sane(struct scan_control 
*sc)
 #else
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
-   return 0;
+   return -ENOSYS;
 }
 
 static void unregister_memcg_shrinker(struct shrinker *shrinker)
@@ -537,8 +540,18 @@ static unsigned long lruvec_lru_size(struct lruvec 
*lruvec, enum lru_list lru,
  */
 int prealloc_shrinker(struct shrinker *shrinker)
 {
-   unsigned int size = sizeof(*shrinker->nr_deferred);
+   unsigned int size;
+   int err;
+
+   if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+   err = prealloc_memcg_shrinker(shrinker);
+   if (err != -ENOSYS)
+   return err;
 
+   shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
+   }
+
+   size = sizeof(*shrinker->nr_deferred);
if (shrinker->flags & SHRINKER_NUMA_AWARE)
size *= nr_node_ids;
 
@@ -546,28 +559,16 @@ int prealloc_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;
 
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-   if (prealloc_memcg_shrinker(shrinker))
-   goto free_deferred;
-   }
-
return 0;
-
-free_deferred:
-   kfree(shrinker->nr_deferred);
-   shrinker->nr_deferred = NULL;
-   return -ENOMEM;
 }
 
 void free_prealloced_shrinker(struct shrinker *shrinker)
 {
-   if (!shrinker->nr_deferred)
-   return;
-
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
down_write(_rwsem);
unregister_memcg_shrinker(shrinker);
up_write(_rwsem);
+   return;
}
 
kfree(shrinker->nr_deferred);
-- 
2.26.2

[v10 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

2021-03-11 Thread Yang Shi

Currently the number of deferred objects are per shrinker, but some slabs, for
example, vfs inode/dentry cache are per memcg, this would result in poor
isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations, one
memcg with excessive __GFP_NOFS allocations may blow up deferred objects, then
other innocent memcgs may suffer from over shrink, excessive reclaim latency,
etc.

For example, two workloads run in memcgA and memcgB respectively, workload in
B is vfs heavy workload.  Workload in A generates excessive deferred objects,
then B's vfs cache might be hit heavily (drop half of caches) by B's limit
reclaim or global reclaim.

We observed this hit in our production environment which was running vfs heavy
workload shown as the below tracing log:

<...>-409454 [016]  28286961.747146: mm_shrink_slab_start: 
super_cache_scan+0x0/0x1a0 9a83046f3458:
nid: 1 objects to shrink 3641681686040 gfp_flags 
GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
cache items 246404277 delta 31345 total_scan 123202138
<...>-409454 [022]  28287105.928018: mm_shrink_slab_end: 
super_cache_scan+0x0/0x1a0 9a83046f3458:
nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 
602
last shrinker return val 123186855

The vfs cache and page cache ratio was 10:1 on this machine, and half of caches
were dropped.  This also resulted in significant amount of page caches were
dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness
and bring better isolation.

The following patch will add nr_deferred to parent memcg when memcg offline.
To preserve nr_deferred when reparenting memcgs to root, root memcg needs
shrinker_info allocated too.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's
nr_deferred would be used.  And non memcg aware shrinkers use shrinker's
nr_deferred all the time.

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  7 +++--
 mm/vmscan.c| 60 ++
 2 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc7d0e2cb3ad..24e735434a46 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -114,12 +114,13 @@ struct batched_lruvec_stat {
 };
 
 /*
- * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
- * which have elements charged to this memcg.
+ * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
+ * shrinkers, which have elements charged to this memcg.
  */
 struct shrinker_info {
struct rcu_head rcu;
-   unsigned long map[];
+   atomic_long_t *nr_deferred;
+   unsigned long *map;
 };
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34cf3d84309c..397f3b67bad8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
 
+/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
 static inline int shrinker_map_size(int nr_items)
 {
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
+static inline int shrinker_defer_size(int nr_items)
+{
+   return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
+}
+
 static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 int nid)
 {
@@ -200,11 +206,13 @@ static struct shrinker_info 
*shrinker_info_protected(struct mem_cgroup *memcg,
 }
 
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
-   int size, int old_size)
+   int map_size, int defer_size,
+   int old_map_size, int old_defer_size)
 {
struct shrinker_info *new, *old;
struct mem_cgroup_per_node *pn;
int nid;
+   int size = map_size + defer_size;
 
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
@@ -217,9 +225,16 @@ static int expand_one_shrinker_info(struct mem_cgroup 
*memcg,
if (!new)
return -ENOMEM;
 
-   /* Set all old bits, clear all new bits */
-   memset(new->map, (int)0xff, old_size);
-   memset((void *)new->map + old_size, 0, size - old_size);
+   new->nr_deferred = (atomic_long_t *)(new + 1);
+   new->map = (void *)new->nr_deferred + defer_size;
+
+   /* map: set all old bits, clear all new bits */
+   memset(new->map, (int)0xff, old_map_size);
+   memset((void *)new->map + old_map_size, 0, map_size - 
old_map_size);
+   /* nr_deferred: copy old values, clear all new values */
+   memcpy(new->nr_def

[v10 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

2021-03-11 Thread Yang Shi

Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's 
nr_deferred
will be used in the following cases:
1. Non memcg aware shrinkers
2. !CONFIG_MEMCG
3. memcg is disabled by boot parameter

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 78 -
 1 file changed, 66 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 397f3b67bad8..5bc6975cb635 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -376,6 +376,24 @@ static void unregister_memcg_shrinker(struct shrinker 
*shrinker)
idr_remove(_idr, id);
 }
 
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+  struct mem_cgroup *memcg)
+{
+   struct shrinker_info *info;
+
+   info = shrinker_info_protected(memcg, nid);
+   return atomic_long_xchg(>nr_deferred[shrinker->id], 0);
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+   struct shrinker_info *info;
+
+   info = shrinker_info_protected(memcg, nid);
+   return atomic_long_add_return(nr, >nr_deferred[shrinker->id]);
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
return sc->target_mem_cgroup;
@@ -414,6 +432,18 @@ static void unregister_memcg_shrinker(struct shrinker 
*shrinker)
 {
 }
 
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+  struct mem_cgroup *memcg)
+{
+   return 0;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+   return 0;
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
return false;
@@ -425,6 +455,39 @@ static bool writeback_throttling_sane(struct scan_control 
*sc)
 }
 #endif
 
+static long xchg_nr_deferred(struct shrinker *shrinker,
+struct shrink_control *sc)
+{
+   int nid = sc->nid;
+
+   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+   nid = 0;
+
+   if (sc->memcg &&
+   (shrinker->flags & SHRINKER_MEMCG_AWARE))
+   return xchg_nr_deferred_memcg(nid, shrinker,
+ sc->memcg);
+
+   return atomic_long_xchg(>nr_deferred[nid], 0);
+}
+
+
+static long add_nr_deferred(long nr, struct shrinker *shrinker,
+   struct shrink_control *sc)
+{
+   int nid = sc->nid;
+
+   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+   nid = 0;
+
+   if (sc->memcg &&
+   (shrinker->flags & SHRINKER_MEMCG_AWARE))
+   return add_nr_deferred_memcg(nr, nid, shrinker,
+sc->memcg);
+
+   return atomic_long_add_return(nr, >nr_deferred[nid]);
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -561,14 +624,10 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
long freeable;
long nr;
long new_nr;
-   int nid = shrinkctl->nid;
long batch_size = shrinker->batch ? shrinker->batch
  : SHRINK_BATCH;
long scanned = 0, next_deferred;
 
-   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-   nid = 0;
-
freeable = shrinker->count_objects(shrinker, shrinkctl);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;
@@ -578,7 +637,7 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
 * and zero it so that other concurrent shrinker invocations
 * don't also do this scanning work.
 */
-   nr = atomic_long_xchg(>nr_deferred[nid], 0);
+   nr = xchg_nr_deferred(shrinker, shrinkctl);
 
total_scan = nr;
if (shrinker->seeks) {
@@ -669,14 +728,9 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
next_deferred = 0;
/*
 * move the unused scan count back into the shrinker in a
-* manner that handles concurrent updates. If we exhausted the
-* scan, there is no need to do an update.
+* manner that handles concurrent updates.
 */
-   if (next_deferred > 0)
-   new_nr = atomic_long_add_return(next_deferred,
-   >nr_deferred[nid]);
-   else
-   new_nr = atomic_long_read(>nr_deferred[nid]);
+   new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
 
trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, 
total_scan);
return freed;
-- 
2.26.2

[v10 PATCH 07/13] mm: vmscan: add shrinker_info_protected() helper

2021-03-11 Thread Yang Shi

The shrinker_info is dereferenced in a couple of places via 
rcu_dereference_protected
with different calling conventions, for example, using mem_cgroup_nodeinfo 
helper
or dereferencing memcg->nodeinfo[nid]->shrinker_info.  And the later patch
will add more dereference places.

So extract the dereference into a helper to make the code more readable.  No
functional change.

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Acked-by: Vlastimil Babka 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7fdfdacf9a1f..ef9f1531a6ee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,6 +192,13 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
+static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
+int nid)
+{
+   return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+lockdep_is_held(_rwsem));
+}
+
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int size, int old_size)
 {
@@ -201,7 +208,7 @@ static int expand_one_shrinker_info(struct mem_cgroup 
*memcg,
 
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
-   old = rcu_dereference_protected(pn->shrinker_info, true);
+   old = shrinker_info_protected(memcg, nid);
/* Not yet online memcg */
if (!old)
return 0;
@@ -232,7 +239,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
 
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
-   info = rcu_dereference_protected(pn->shrinker_info, true);
+   info = shrinker_info_protected(memcg, nid);
kvfree(info);
rcu_assign_pointer(pn->shrinker_info, NULL);
}
@@ -675,8 +682,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int 
nid,
if (!down_read_trylock(_rwsem))
return 0;
 
-   info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
-true);
+   info = shrinker_info_protected(memcg, nid);
if (unlikely(!info))
goto unlock;
 
-- 
2.26.2

[v10 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered

2021-03-11 Thread Yang Shi

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred at the shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
from unregistering correctly.

Remove SHRINKER_REGISTERING since we could check if shrinker is registered
successfully by the new flag.

Acked-by: Kirill Tkhai 
Acked-by: Vlastimil Babka 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/shrinker.h |  7 ---
 mm/vmscan.c  | 40 +++-
 2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1eac79ce57d4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -79,13 +79,14 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE(1 << 0)
-#define SHRINKER_MEMCG_AWARE   (1 << 1)
+#define SHRINKER_REGISTERED(1 << 0)
+#define SHRINKER_NUMA_AWARE(1 << 1)
+#define SHRINKER_MEMCG_AWARE   (1 << 2)
 /*
  * It just makes sense when the shrinker is also MEMCG_AWARE for now,
  * non-MEMCG_AWARE shrinker should not have this flag set.
  */
-#define SHRINKER_NONSLAB   (1 << 2)
+#define SHRINKER_NONSLAB   (1 << 3)
 
 extern int prealloc_shrinker(struct shrinker *shrinker);
 extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ef9f1531a6ee..34cf3d84309c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -316,19 +316,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
int shrinker_id)
}
 }
 
-/*
- * We allow subsystems to populate their shrinker-related
- * LRU lists before register_shrinker_prepared() is called
- * for the shrinker, since we don't want to impose
- * restrictions on their internal registration order.
- * In this case shrink_slab_memcg() may find corresponding
- * bit is set in the shrinkers map.
- *
- * This value is used by the function to detect registering
- * shrinkers and to skip do_shrink_slab() calls for them.
- */
-#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
-
 static DEFINE_IDR(shrinker_idr);
 
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
@@ -337,7 +324,7 @@ static int prealloc_memcg_shrinker(struct shrinker 
*shrinker)
 
down_write(_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
-   id = idr_alloc(_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
+   id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
 
@@ -360,9 +347,9 @@ static void unregister_memcg_shrinker(struct shrinker 
*shrinker)
 
BUG_ON(id < 0);
 
-   down_write(_rwsem);
+   lockdep_assert_held(_rwsem);
+
idr_remove(_idr, id);
-   up_write(_rwsem);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
@@ -490,8 +477,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return;
 
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+   if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+   down_write(_rwsem);
unregister_memcg_shrinker(shrinker);
+   up_write(_rwsem);
+   }
 
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
@@ -501,10 +491,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 {
down_write(_rwsem);
list_add_tail(>list, _list);
-#ifdef CONFIG_MEMCG
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-   idr_replace(_idr, shrinker, shrinker->id);
-#endif
+   shrinker->flags |= SHRINKER_REGISTERED;
up_write(_rwsem);
 }
 
@@ -524,13 +511,16 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-   if (!shrinker->nr_deferred)
+   if (!(shrinker->flags & SHRINKER_REGISTERED))
return;
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-   unregister_memcg_shrinker(shrinker);
+
down_write(_rwsem);
list_del(>list);
+   shrinker->flags &= ~SHRINKER_REGISTERED;
+   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+   unregister_memcg_shrinker(shrinker);
up_write(_rwsem);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
 }
@@ -695,7 +685,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int 
nid,
struct shrinker *shrinker;
 
shrinker = idr_find(_idr, i);
-   if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
+   if (unlikely(!shrinke

[v10 PATCH 06/13] mm: memcontrol: rename shrinker_map to shrinker_info

2021-03-11 Thread Yang Shi

The following patch is going to add nr_deferred into shrinker_map, the change 
will
make shrinker_map not only include map anymore, so rename it to 
"memcg_shrinker_info".
And this should make the patch adding nr_deferred cleaner and readable and make
review easier.  Also remove the "memcg_" prefix.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  8 +++---
 mm/memcontrol.c|  6 ++--
 mm/vmscan.c| 58 +++---
 3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b07dc2d5014d..dc7d0e2cb3ad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -117,7 +117,7 @@ struct batched_lruvec_stat {
  * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
  * which have elements charged to this memcg.
  */
-struct memcg_shrinker_map {
+struct shrinker_info {
struct rcu_head rcu;
unsigned long map[];
 };
@@ -145,7 +145,7 @@ struct mem_cgroup_per_node {
 
struct mem_cgroup_reclaim_iter  iter;
 
-   struct memcg_shrinker_map __rcu *shrinker_map;
+   struct shrinker_info __rcu  *shrinker_info;
 
struct rb_node  tree_node;  /* RB tree node */
unsigned long   usage_in_excess;/* Set to the value by which */
@@ -1533,8 +1533,8 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
 }
 
-int alloc_shrinker_maps(struct mem_cgroup *memcg);
-void free_shrinker_maps(struct mem_cgroup *memcg);
+int alloc_shrinker_info(struct mem_cgroup *memcg);
+void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aeb1847f159c..35d44afdd9fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5131,11 +5131,11 @@ static int mem_cgroup_css_online(struct 
cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
/*
-* A memcg must be visible for expand_shrinker_maps()
+* A memcg must be visible for expand_shrinker_info()
 * by the time the maps are allocated. So, we allocate maps
 * here, when for_each_mem_cgroup() can't skip it.
 */
-   if (alloc_shrinker_maps(memcg)) {
+   if (alloc_shrinker_info(memcg)) {
mem_cgroup_id_remove(memcg);
return -ENOMEM;
}
@@ -5199,7 +5199,7 @@ static void mem_cgroup_css_free(struct 
cgroup_subsys_state *css)
vmpressure_cleanup(>vmpressure);
cancel_work_sync(>high_work);
mem_cgroup_remove_from_trees(memcg);
-   free_shrinker_maps(memcg);
+   free_shrinker_info(memcg);
memcg_free_kmem(memcg);
mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bbe13985ae05..7fdfdacf9a1f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,16 +192,16 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
-static int expand_one_shrinker_map(struct mem_cgroup *memcg,
-  int size, int old_size)
+static int expand_one_shrinker_info(struct mem_cgroup *memcg,
+   int size, int old_size)
 {
-   struct memcg_shrinker_map *new, *old;
+   struct shrinker_info *new, *old;
struct mem_cgroup_per_node *pn;
int nid;
 
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
-   old = rcu_dereference_protected(pn->shrinker_map, true);
+   old = rcu_dereference_protected(pn->shrinker_info, true);
/* Not yet online memcg */
if (!old)
return 0;
@@ -214,17 +214,17 @@ static int expand_one_shrinker_map(struct mem_cgroup 
*memcg,
memset(new->map, (int)0xff, old_size);
memset((void *)new->map + old_size, 0, size - old_size);
 
-   rcu_assign_pointer(pn->shrinker_map, new);
+   rcu_assign_pointer(pn->shrinker_info, new);
kvfree_rcu(old, rcu);
}
 
return 0;
 }
 
-void free_shrinker_maps(struct mem_cgroup *memcg)
+void free_shrinker_info(struct mem_cgroup *memcg)
 {
struct mem_cgroup_per_node *pn;
-   struct memcg_shrinker_map *map;
+   struct shrinker_info *info;
int nid;
 
if (mem_cgroup_is_root(memcg))
@@ -232,15 +232,15 @@ void free_shrinker_maps(struct mem_cgroup *memcg)
 
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
-   map = rcu_dereference_protected(pn->shrinker_map, true);
-   kvfree(map);
-   rcu_assign_

[v10 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

2021-03-11 Thread Yang Shi

Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
We don't have to define a dedicated callback for call_rcu() anymore.

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 641a0b8b4ea9..bbe13985ae05 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
-static void free_shrinker_map_rcu(struct rcu_head *head)
-{
-   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
-}
-
 static int expand_one_shrinker_map(struct mem_cgroup *memcg,
   int size, int old_size)
 {
@@ -220,7 +215,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
memset((void *)new->map + old_size, 0, size - old_size);
 
rcu_assign_pointer(pn->shrinker_map, new);
-   call_rcu(>rcu, free_shrinker_map_rcu);
+   kvfree_rcu(old, rcu);
}
 
return 0;
-- 
2.26.2

[v10 PATCH 04/13] mm: vmscan: remove memcg_shrinker_map_size

2021-03-11 Thread Yang Shi

Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep 
both.
Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating 
the
bit map.

Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Acked-by: Vlastimil Babka 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b08c8d9055ae..641a0b8b4ea9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,8 +185,12 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_MEMCG
+static int shrinker_nr_max;
 
-static int memcg_shrinker_map_size;
+static inline int shrinker_map_size(int nr_items)
+{
+   return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
+}
 
 static void free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -248,7 +252,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
return 0;
 
down_write(_rwsem);
-   size = memcg_shrinker_map_size;
+   size = shrinker_map_size(shrinker_nr_max);
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
if (!map) {
@@ -266,12 +270,13 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
 static int expand_shrinker_maps(int new_id)
 {
int size, old_size, ret = 0;
+   int new_nr_max = new_id + 1;
struct mem_cgroup *memcg;
 
-   size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
-   old_size = memcg_shrinker_map_size;
+   size = shrinker_map_size(new_nr_max);
+   old_size = shrinker_map_size(shrinker_nr_max);
if (size <= old_size)
-   return 0;
+   goto out;
 
if (!root_mem_cgroup)
goto out;
@@ -290,7 +295,7 @@ static int expand_shrinker_maps(int new_id)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 out:
if (!ret)
-   memcg_shrinker_map_size = size;
+   shrinker_nr_max = new_nr_max;
 
return ret;
 }
@@ -323,7 +328,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
int shrinker_id)
 #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
 
 static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;
 
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
@@ -340,8 +344,6 @@ static int prealloc_memcg_shrinker(struct shrinker 
*shrinker)
idr_remove(_idr, id);
goto unlock;
}
-
-   shrinker_nr_max = id + 1;
}
shrinker->id = id;
ret = 0;
-- 
2.26.2

[v10 PATCH 03/13] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation

2021-03-11 Thread Yang Shi

Since memcg_shrinker_map_size just can be changed under holding shrinker_rwsem
exclusively, the read side can be protected by holding read lock, so it sounds
superfluous to have a dedicated mutex.

Kirill Tkhai suggested use write lock since:

  * We want the assignment to shrinker_maps is visible for shrink_slab_memcg().
  * The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
is not actually protected.
  * READ lock makes alloc_shrinker_info() racy against memory allocation fail.
alloc_shrinker_info()->free_shrinker_info() may free memory right after
shrink_slab_memcg() dereferenced it. You may say
shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
but this is not the thing we want to remember in the future, since this
spreads modularity.

And a test with heavy paging workload didn't show write lock makes things worse.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 98a44fb81f8a..b08c8d9055ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_MEMCG
 
 static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
 
 static void free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -201,8 +200,6 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
struct mem_cgroup_per_node *pn;
int nid;
 
-   lockdep_assert_held(_shrinker_map_mutex);
-
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
old = rcu_dereference_protected(pn->shrinker_map, true);
@@ -250,7 +247,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
if (mem_cgroup_is_root(memcg))
return 0;
 
-   mutex_lock(_shrinker_map_mutex);
+   down_write(_rwsem);
size = memcg_shrinker_map_size;
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
@@ -261,7 +258,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
}
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
}
-   mutex_unlock(_shrinker_map_mutex);
+   up_write(_rwsem);
 
return ret;
 }
@@ -276,9 +273,10 @@ static int expand_shrinker_maps(int new_id)
if (size <= old_size)
return 0;
 
-   mutex_lock(_shrinker_map_mutex);
if (!root_mem_cgroup)
-   goto unlock;
+   goto out;
+
+   lockdep_assert_held(_rwsem);
 
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
@@ -287,13 +285,13 @@ static int expand_shrinker_maps(int new_id)
ret = expand_one_shrinker_map(memcg, size, old_size);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
-   goto unlock;
+   goto out;
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-unlock:
+out:
if (!ret)
memcg_shrinker_map_size = size;
-   mutex_unlock(_shrinker_map_mutex);
+
return ret;
 }
 
-- 
2.26.2

[v10 PATCH 02/13] mm: vmscan: consolidate shrinker_maps handling code

2021-03-11 Thread Yang Shi

The shrinker map management is not purely memcg specific, it is at the 
intersection
between memory cgroup and shrinkers.  It's allocation and assignment of a 
structure,
and the only memcg bit is the map is being stored in a memcg structure.  So 
move the
shrinker_maps handling code into vmscan.c for tighter integration with shrinker 
code,
and remove the "memcg_" prefix.  There is no functional change.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  11 ++--
 mm/huge_memory.c   |   4 +-
 mm/list_lru.c  |   6 +-
 mm/memcontrol.c| 130 +---
 mm/vmscan.c| 132 -
 5 files changed, 142 insertions(+), 141 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e946c96daa32..b07dc2d5014d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1533,10 +1533,9 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
 }
 
-extern int memcg_expand_shrinker_maps(int new_id);
-
-extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
-  int nid, int shrinker_id);
+int alloc_shrinker_maps(struct mem_cgroup *memcg);
+void free_shrinker_maps(struct mem_cgroup *memcg);
+void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 #else
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
@@ -1546,8 +1545,8 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
 }
 
-static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
- int nid, int shrinker_id)
+static inline void set_shrinker_bit(struct mem_cgroup *memcg,
+   int nid, int shrinker_id)
 {
 }
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bff92dea5ab3..92a1227be029 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2836,8 +2836,8 @@ void deferred_split_huge_page(struct page *page)
ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
if (memcg)
-   memcg_set_shrinker_bit(memcg, page_to_nid(page),
-  deferred_split_shrinker.id);
+   set_shrinker_bit(memcg, page_to_nid(page),
+deferred_split_shrinker.id);
 #endif
}
spin_unlock_irqrestore(_queue->split_queue_lock, flags);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 6f067b6b935f..cd58790d0fb3 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -125,8 +125,8 @@ bool list_lru_add(struct list_lru *lru, struct list_head 
*item)
list_add_tail(item, >list);
/* Set shrinker bit if the first element was added */
if (!l->nr_items++)
-   memcg_set_shrinker_bit(memcg, nid,
-  lru_shrinker_id(lru));
+   set_shrinker_bit(memcg, nid,
+lru_shrinker_id(lru));
nlru->nr_items++;
spin_unlock(>lock);
return true;
@@ -540,7 +540,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, 
int nid,
 
if (src->nr_items) {
dst->nr_items += src->nr_items;
-   memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
+   set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
src->nr_items = 0;
}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da6fded852df..aeb1847f159c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -402,130 +402,6 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
-static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
-
-static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
-{
-   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
-}
-
-static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
-int size, int old_size)
-{
-   struct memcg_shrinker_map *new, *old;
-   struct mem_cgroup_per_node *pn;
-   int nid;
-
-   lockdep_assert_held(_shrinker_map_mutex);
-
-   for_each_node(nid) {
-   pn = memcg->nodeinfo[nid];
-   old = rcu_dereference_protected(pn->shrinker_map, true);
-   /* Not yet online memcg */
-   if (!old)
-   return 0;
-
-   new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
-   if (!new)
-   return -ENOMEM;
-
-

[v10 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-03-11 Thread Yang Shi

d /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
cat $FILE 2>/dev/null
done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below 
tracing result
showed:

kswapd0-475   [028]  305968.252561: mm_shrink_slab_start: 
super_cache_scan+0x0/0x190 24acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 
45746 total_scan 46844936 priority 12
kswapd0-475   [021]  306013.099399: mm_shrink_slab_end: 
super_cache_scan+0x0/0x190 24acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker 
return val 46844928

There were huge number of deferred objects before the shrinker was called, the 
behavior
does match the code but it might be not desirable from the user's stand of 
point.

The excessive amount of nr_deferred might be accumulated due to various 
reasons, for example:
* GFP_NOFS allocation
* Significant times of small amount scan (< scan_batch, 1024 for vfs 
metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the 
deferred objects
is per shrinker, this may have some bad effects:
* Poor isolation among memcgs. Some memcgs which happen to have frequent 
limit
  reclaim may get nr_deferred accumulated to a huge number, then other 
innocent
  memcgs may take the fall. In our case the main workload was hit.
* Unbounded deferred objects. There is no cap for deferred objects, it can 
outgrow
  ridiculously as the tracing result showed.
* Easy to get out of control. Although shrinkers take into account deferred 
objects,
  but it can go out of control easily. One misconfigured memcg could incur 
absurd 
  amount of deferred objects in a period of time.
* Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. 
There may be
  hundred GB slab caches for vfe metadata heavy workload, shrink half of 
them may take
  minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in 
https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828...@gmail.com/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
* Have memcg_shrinker_deferred per memcg per node, just like what 
shrinker_map
  does. Instead it is an atomic_long_t array, each element represent one 
shrinker
  even though the shrinker is not memcg aware, this simplifies the 
implementation.
  For memcg aware shrinkers, the deferred objects are just accumulated to 
its own
  memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg 
aware
  shrinkers still use global nr_deferred from struct shrinker.
* Once the memcg is offlined, its nr_deferred will be reparented to its 
parent along
  with LRUs.
* The root memcg has memcg_shrinker_deferred array too. It simplifies the 
handling of
  reparenting to root memcg.
* Cap nr_deferred to 2x of the length of lru. The idea is borrowed from 
Dave Chinner's
  series 
(https://lore.kernel.org/linux-xfs/20191031234618.15403-1-da...@fromorbit.com/)

The downside is each memcg has to allocate extra memory to store the 
nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each 
memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and 
production) for
months, it works very well. The monitor data shows the working set is sustained 
as expected.

Yang Shi (13):
  mm: vmscan: use nid from shrink_control for tracepoint
  mm: vmscan: consolidate shrinker_maps handling code
  mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  mm: vmscan: remove memcg_shrinker_map_size
  mm: vmscan: use kvfree_rcu instead of call_rcu
  mm: memcontrol: rename shrinker_map to shrinker_info
  mm: vmscan: add shrinker_info_protected() helper
  mm: vmscan: use a new flag to indicate shrinker is registered
  mm: vmscan: add per memcg shrinker nr_deferred
  mm: vmscan: use per memcg nr_deferred of shrinker
  mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware 
shrinkers
  mm: memcontrol: reparent nr_deferred when memcg offline
  mm: vmscan: shrink deferred objects proportional to priority

 include/linux/memcontrol.h |  23 +++---
 include/linux/shrinker.h   |   7 +-
 mm/huge_memory.c   |   4 +-
 mm/list_lru.c  |   6 +-
 mm/memcontrol.c| 130 +--
 mm/vmscan.c| 394 

 6 files changed, 319 insertions(+), 245 deletions(-)

[v10 PATCH 01/13] mm: vmscan: use nid from shrink_control for tracepoint

2021-03-11 Thread Yang Shi

The tracepoint's nid should show what node the shrink happens on, the start 
tracepoint
uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if 
the
shrinker is not NUMA aware, so the tracing log may show the shrink happens on 
one
node but end up on the other node.  It seems confusing.  And the following patch
will remove using nid directly in do_shrink_slab(), this patch also helps 
cleanup
the code.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 407051ebe869..bdc32c803c66 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,7 +536,7 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
else
new_nr = atomic_long_read(>nr_deferred[nid]);
 
-   trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+   trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, 
total_scan);
return freed;
 }
 
-- 
2.26.2

Re: [PATCH v2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-10 Thread Yang Shi

On Wed, Mar 10, 2021 at 7:36 AM Zi Yan  wrote:
>
> From: Zi Yan 
>
> We do not have a direct user interface of splitting the compound page
> backing a THP and there is no need unless we want to expose the THP
> implementation details to users. Adding an interface for debugging.
>
> By writing ",," to
> /split_huge_pages_in_range_pid, THPs within the given virtual

Can we reuse the existing split_huge_page knob instead of creating a new one?

Two knobs for splitting huge pages on debugging purpose seem
overkilling to me IMHO. I'm wondering if we could check if a special
value (e.g. 1 or -1) is written then split all THPs as split_huge_page
knob does?

I don't think this interface is used widely so the risk should be very
low for breaking userspace.

> address range from the process with the given pid are split. It is used
> to test split_huge_page function. In addition, a selftest program is
> added to tools/testing/selftests/vm to utilize the interface by
> splitting PMD THPs and PTE-mapped THPs.
>
> Changelog:
>
> From v1:
>
> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>robot .
> 2. Dropped the use of find_mm_struct and code it directly, since there
>is no need for the permission check in that function and the function
>is only available when migration is on.
> 3. Added some comments in the selftest program to clarify how PTE-mapped
>THPs are formed.
>
> Signed-off-by: Zi Yan 
> ---
>  mm/huge_memory.c  | 112 ++
>  tools/testing/selftests/vm/.gitignore |   1 +
>  tools/testing/selftests/vm/Makefile   |   1 +
>  .../selftests/vm/split_huge_page_test.c   | 320 ++
>  4 files changed, 434 insertions(+)
>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bff92dea5ab3..7797e8b2aba0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -7,6 +7,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2965,10 +2966,121 @@ static int split_huge_pages_set(void *data, u64 val)
>  DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
> "%llu\n");
>
> +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
> +   const char __user *buf, size_t count, loff_t *ppops)
> +{
> +   static DEFINE_MUTEX(mutex);
> +   ssize_t ret;
> +   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
> +   int pid;
> +   unsigned long vaddr_start, vaddr_end, addr;
> +   struct task_struct *task;
> +   struct mm_struct *mm;
> +   unsigned long total = 0, split = 0;
> +
> +   ret = mutex_lock_interruptible();
> +   if (ret)
> +   return ret;
> +
> +   ret = -EFAULT;
> +
> +   memset(input_buf, 0, 80);
> +   if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
> +   goto out;
> +
> +   input_buf[79] = '\0';
> +   ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
> _end);
> +   if (ret != 3) {
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +   vaddr_start &= PAGE_MASK;
> +   vaddr_end &= PAGE_MASK;
> +
> +   ret = strlen(input_buf);
> +   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
> +pid, vaddr_start, vaddr_end);
> +
> +   /* Find the task_struct from pid */
> +   rcu_read_lock();
> +   task = find_task_by_vpid(pid);
> +   if (!task) {
> +   rcu_read_unlock();
> +   ret = -ESRCH;
> +   goto out;
> +   }
> +   get_task_struct(task);
> +   rcu_read_unlock();
> +
> +   /* Find the mm_struct */
> +   mm = get_task_mm(task);
> +   put_task_struct(task);
> +
> +   if (!mm) {
> +   ret = -EINVAL;
> +   goto out;
> +   }
> +
> +   mmap_read_lock(mm);
> +   /*
> +* always increase addr by PAGE_SIZE, since we could have a PTE page
> +* table filled with PTE-mapped THPs, each of which is distinct.
> +*/
> +   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
> +   struct vm_area_struct *vma = find_vma(mm, addr);
> +   unsigned int follflags;
> +   struct page *page;
> +
> +   if (!vma || addr < vma->vm_start)
> +   break;
> +
> +   /* FOLL_DUMP to ignore special (like zero) pages */
> +   follflags = FOLL_GET | FOLL_DUMP;
> +   page = follow_page(vma, addr, follflags);
> +
> +   if (IS_ERR(page))
> +   break;
> +   if (!page)
> +   break;
> +
> +   if (!is_transparent_hugepage(page))
> +   continue;
> +
> +   total++;
> +   if (!can_split_huge_page(compound_head(page), NULL))
> +

Re: [v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Yang Shi

On Wed, Mar 10, 2021 at 2:41 PM Shakeel Butt  wrote:
>
> On Wed, Mar 10, 2021 at 1:41 PM Yang Shi  wrote:
> >
> > On Wed, Mar 10, 2021 at 1:08 PM Shakeel Butt  wrote:
> > >
> > > On Wed, Mar 10, 2021 at 10:54 AM Yang Shi  wrote:
> > > >
> > > > On Wed, Mar 10, 2021 at 10:24 AM Shakeel Butt  
> > > > wrote:
> > > > >
> > > > > On Wed, Mar 10, 2021 at 9:46 AM Yang Shi  wrote:
> > > > > >
> > > > > > The number of deferred objects might get windup to an absurd 
> > > > > > number, and it
> > > > > > results in clamp of slab objects.  It is undesirable for sustaining 
> > > > > > workingset.
> > > > > >
> > > > > > So shrink deferred objects proportional to priority and cap 
> > > > > > nr_deferred to twice
> > > > > > of cache items.
> > > > > >
> > > > > > The idea is borrowed from Dave Chinner's patch:
> > > > > > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/
> > > > > >
> > > > > > Tested with kernel build and vfs metadata heavy workload in our 
> > > > > > production
> > > > > > environment, no regression is spotted so far.
> > > > >
> > > > > Did you run both of these workloads in the same cgroup or separate 
> > > > > cgroups?
> > > >
> > > > Both are covered.
> > > >
> > >
> > > Have you tried just this patch i.e. without the first 12 patches?
> >
> > No. It could be applied without the first 12 patches, but I didn't
> > test this combination specifically since I don't think it would have
> > any difference from with the first 12 patches. I tested running the
> > test case under root memcg, it seems equal to w/o the first 12 patches
> > and the only difference is where to get nr_deferred.
>
> I am trying to measure the impact of this patch independently. One
> point I can think of is the global reclaim. The first 12 patches do
> not aim to improve the global reclaim but this patch will. I am just
> wondering what would be negative if any of this patch.

Feel free to do so. More tests from more workloads are definitely
appreciated. That could give us more confidence about this patch or
catch regression sooner.

Re: [v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Yang Shi

On Wed, Mar 10, 2021 at 1:08 PM Shakeel Butt  wrote:
>
> On Wed, Mar 10, 2021 at 10:54 AM Yang Shi  wrote:
> >
> > On Wed, Mar 10, 2021 at 10:24 AM Shakeel Butt  wrote:
> > >
> > > On Wed, Mar 10, 2021 at 9:46 AM Yang Shi  wrote:
> > > >
> > > > The number of deferred objects might get windup to an absurd number, 
> > > > and it
> > > > results in clamp of slab objects.  It is undesirable for sustaining 
> > > > workingset.
> > > >
> > > > So shrink deferred objects proportional to priority and cap nr_deferred 
> > > > to twice
> > > > of cache items.
> > > >
> > > > The idea is borrowed from Dave Chinner's patch:
> > > > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/
> > > >
> > > > Tested with kernel build and vfs metadata heavy workload in our 
> > > > production
> > > > environment, no regression is spotted so far.
> > >
> > > Did you run both of these workloads in the same cgroup or separate 
> > > cgroups?
> >
> > Both are covered.
> >
>
> Have you tried just this patch i.e. without the first 12 patches?

No. It could be applied without the first 12 patches, but I didn't
test this combination specifically since I don't think it would have
any difference from with the first 12 patches. I tested running the
test case under root memcg, it seems equal to w/o the first 12 patches
and the only difference is where to get nr_deferred.

Re: [v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Yang Shi

On Wed, Mar 10, 2021 at 10:24 AM Shakeel Butt  wrote:
>
> On Wed, Mar 10, 2021 at 9:46 AM Yang Shi  wrote:
> >
> > The number of deferred objects might get windup to an absurd number, and it
> > results in clamp of slab objects.  It is undesirable for sustaining 
> > workingset.
> >
> > So shrink deferred objects proportional to priority and cap nr_deferred to 
> > twice
> > of cache items.
> >
> > The idea is borrowed from Dave Chinner's patch:
> > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/
> >
> > Tested with kernel build and vfs metadata heavy workload in our production
> > environment, no regression is spotted so far.
>
> Did you run both of these workloads in the same cgroup or separate cgroups?

Both are covered.

>
> >
> > Signed-off-by: Yang Shi 
> > ---
> >  mm/vmscan.c | 46 +++---
> >  1 file changed, 11 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9a2dfeaa79f4..6a0a91b23597 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -662,7 +662,6 @@ static unsigned long do_shrink_slab(struct 
> > shrink_control *shrinkctl,
> >  */
> > nr = xchg_nr_deferred(shrinker, shrinkctl);
> >
> > -   total_scan = nr;
> > if (shrinker->seeks) {
> > delta = freeable >> priority;
> > delta *= 4;
> > @@ -676,37 +675,9 @@ static unsigned long do_shrink_slab(struct 
> > shrink_control *shrinkctl,
> > delta = freeable / 2;
> > }
> >
> > +   total_scan = nr >> priority;
> > total_scan += delta;
> > -   if (total_scan < 0) {
> > -   pr_err("shrink_slab: %pS negative objects to delete 
> > nr=%ld\n",
> > -  shrinker->scan_objects, total_scan);
> > -   total_scan = freeable;
> > -   next_deferred = nr;
> > -   } else
> > -   next_deferred = total_scan;
> > -
> > -   /*
> > -* We need to avoid excessive windup on filesystem shrinkers
> > -* due to large numbers of GFP_NOFS allocations causing the
> > -* shrinkers to return -1 all the time. This results in a large
> > -* nr being built up so when a shrink that can do some work
> > -* comes along it empties the entire cache due to nr >>>
> > -* freeable. This is bad for sustaining a working set in
> > -* memory.
> > -*
> > -* Hence only allow the shrinker to scan the entire cache when
> > -* a large delta change is calculated directly.
> > -*/
> > -   if (delta < freeable / 4)
> > -   total_scan = min(total_scan, freeable / 2);
> > -
> > -   /*
> > -* Avoid risking looping forever due to too large nr value:
> > -* never try to free more than twice the estimate number of
> > -* freeable entries.
> > -*/
> > -   if (total_scan > freeable * 2)
> > -   total_scan = freeable * 2;
> > +   total_scan = min(total_scan, (2 * freeable));
> >
> > trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> >freeable, delta, total_scan, priority);
> > @@ -745,10 +716,15 @@ static unsigned long do_shrink_slab(struct 
> > shrink_control *shrinkctl,
> > cond_resched();
> > }
> >
> > -   if (next_deferred >= scanned)
> > -   next_deferred -= scanned;
> > -   else
> > -   next_deferred = 0;
> > +   /*
> > +* The deferred work is increased by any new work (delta) that 
> > wasn't
> > +* done, decreased by old deferred work that was done now.
> > +*
> > +* And it is capped to two times of the freeable items.
> > +*/
> > +   next_deferred = max_t(long, (nr + delta - scanned), 0);
> > +   next_deferred = min(next_deferred, (2 * freeable));
> > +
> > /*
> >  * move the unused scan count back into the shrinker in a
> >  * manner that handles concurrent updates.
> > --
> > 2.26.2
> >

[v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Yang Shi

The number of deferred objects might get windup to an absurd number, and it
results in clamp of slab objects.  It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice
of cache items.

The idea is borrowed from Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/

Tested with kernel build and vfs metadata heavy workload in our production
environment, no regression is spotted so far.

Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 46 +++---
 1 file changed, 11 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9a2dfeaa79f4..6a0a91b23597 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -662,7 +662,6 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
 */
nr = xchg_nr_deferred(shrinker, shrinkctl);
 
-   total_scan = nr;
if (shrinker->seeks) {
delta = freeable >> priority;
delta *= 4;
@@ -676,37 +675,9 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
delta = freeable / 2;
}
 
+   total_scan = nr >> priority;
total_scan += delta;
-   if (total_scan < 0) {
-   pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-  shrinker->scan_objects, total_scan);
-   total_scan = freeable;
-   next_deferred = nr;
-   } else
-   next_deferred = total_scan;
-
-   /*
-* We need to avoid excessive windup on filesystem shrinkers
-* due to large numbers of GFP_NOFS allocations causing the
-* shrinkers to return -1 all the time. This results in a large
-* nr being built up so when a shrink that can do some work
-* comes along it empties the entire cache due to nr >>>
-* freeable. This is bad for sustaining a working set in
-* memory.
-*
-* Hence only allow the shrinker to scan the entire cache when
-* a large delta change is calculated directly.
-*/
-   if (delta < freeable / 4)
-   total_scan = min(total_scan, freeable / 2);
-
-   /*
-* Avoid risking looping forever due to too large nr value:
-* never try to free more than twice the estimate number of
-* freeable entries.
-*/
-   if (total_scan > freeable * 2)
-   total_scan = freeable * 2;
+   total_scan = min(total_scan, (2 * freeable));
 
trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
   freeable, delta, total_scan, priority);
@@ -745,10 +716,15 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
cond_resched();
}
 
-   if (next_deferred >= scanned)
-   next_deferred -= scanned;
-   else
-   next_deferred = 0;
+   /*
+* The deferred work is increased by any new work (delta) that wasn't
+* done, decreased by old deferred work that was done now.
+*
+* And it is capped to two times of the freeable items.
+*/
+   next_deferred = max_t(long, (nr + delta - scanned), 0);
+   next_deferred = min(next_deferred, (2 * freeable));
+
/*
 * move the unused scan count back into the shrinker in a
 * manner that handles concurrent updates.
-- 
2.26.2

[v9 PATCH 11/13] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

2021-03-10 Thread Yang Shi

Now nr_deferred is available on per memcg level for memcg aware shrinkers, so 
don't need
allocate shrinker->nr_deferred for such shrinkers anymore.

The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is 
disabled
by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be 
cleared.
This makes the implementation of this patch simpler.

Acked-by: Vlastimil Babka 
Reviewed-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 326f0e0c4356..cf25c78661d1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -344,6 +344,9 @@ static int prealloc_memcg_shrinker(struct shrinker 
*shrinker)
 {
int id, ret = -ENOMEM;
 
+   if (mem_cgroup_disabled())
+   return -ENOSYS;
+
down_write(_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
@@ -423,7 +426,7 @@ static bool writeback_throttling_sane(struct scan_control 
*sc)
 #else
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
-   return 0;
+   return -ENOSYS;
 }
 
 static void unregister_memcg_shrinker(struct shrinker *shrinker)
@@ -535,8 +538,18 @@ static unsigned long lruvec_lru_size(struct lruvec 
*lruvec, enum lru_list lru,
  */
 int prealloc_shrinker(struct shrinker *shrinker)
 {
-   unsigned int size = sizeof(*shrinker->nr_deferred);
+   unsigned int size;
+   int err;
+
+   if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+   err = prealloc_memcg_shrinker(shrinker);
+   if (err != -ENOSYS)
+   return err;
 
+   shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
+   }
+
+   size = sizeof(*shrinker->nr_deferred);
if (shrinker->flags & SHRINKER_NUMA_AWARE)
size *= nr_node_ids;
 
@@ -544,28 +557,16 @@ int prealloc_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return -ENOMEM;
 
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-   if (prealloc_memcg_shrinker(shrinker))
-   goto free_deferred;
-   }
-
return 0;
-
-free_deferred:
-   kfree(shrinker->nr_deferred);
-   shrinker->nr_deferred = NULL;
-   return -ENOMEM;
 }
 
 void free_prealloced_shrinker(struct shrinker *shrinker)
 {
-   if (!shrinker->nr_deferred)
-   return;
-
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
down_write(_rwsem);
unregister_memcg_shrinker(shrinker);
up_write(_rwsem);
+   return;
}
 
kfree(shrinker->nr_deferred);
-- 
2.26.2

[v9 PATCH 12/13] mm: memcontrol: reparent nr_deferred when memcg offline

2021-03-10 Thread Yang Shi

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to 
parent's
corresponding nr_deferred when memcg offline.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c|  1 +
 mm/vmscan.c| 24 
 3 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 42a4facb5b7c..2c76fe53fb6d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1569,6 +1569,7 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
+void reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #else
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index edd8a06c751f..dacb1c6087ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5262,6 +5262,7 @@ static void mem_cgroup_css_offline(struct 
cgroup_subsys_state *css)
page_counter_set_low(>memory, 0);
 
memcg_offline_kmem(memcg);
+   reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
 
drain_all_stock(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cf25c78661d1..9a2dfeaa79f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -395,6 +395,30 @@ static long add_nr_deferred_memcg(long nr, int nid, struct 
shrinker *shrinker,
return atomic_long_add_return(nr, >nr_deferred[shrinker->id]);
 }
 
+void reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+   int i, nid;
+   long nr;
+   struct mem_cgroup *parent;
+   struct shrinker_info *child_info, *parent_info;
+
+   parent = parent_mem_cgroup(memcg);
+   if (!parent)
+   parent = root_mem_cgroup;
+
+   /* Prevent from concurrent shrinker_info expand */
+   down_read(_rwsem);
+   for_each_node(nid) {
+   child_info = shrinker_info_protected(memcg, nid);
+   parent_info = shrinker_info_protected(parent, nid);
+   for (i = 0; i < shrinker_nr_max; i++) {
+   nr = atomic_long_read(_info->nr_deferred[i]);
+   atomic_long_add(nr, _info->nr_deferred[i]);
+   }
+   }
+   up_read(_rwsem);
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
return sc->target_mem_cgroup;
-- 
2.26.2

[v9 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

2021-03-10 Thread Yang Shi

Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's 
nr_deferred
will be used in the following cases:
1. Non memcg aware shrinkers
2. !CONFIG_MEMCG
3. memcg is disabled by boot parameter

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 78 -
 1 file changed, 66 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ae82afe6cec6..326f0e0c4356 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -374,6 +374,24 @@ static void unregister_memcg_shrinker(struct shrinker 
*shrinker)
idr_remove(_idr, id);
 }
 
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+  struct mem_cgroup *memcg)
+{
+   struct shrinker_info *info;
+
+   info = shrinker_info_protected(memcg, nid);
+   return atomic_long_xchg(>nr_deferred[shrinker->id], 0);
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+   struct shrinker_info *info;
+
+   info = shrinker_info_protected(memcg, nid);
+   return atomic_long_add_return(nr, >nr_deferred[shrinker->id]);
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
return sc->target_mem_cgroup;
@@ -412,6 +430,18 @@ static void unregister_memcg_shrinker(struct shrinker 
*shrinker)
 {
 }
 
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+  struct mem_cgroup *memcg)
+{
+   return 0;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+ struct mem_cgroup *memcg)
+{
+   return 0;
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
return false;
@@ -423,6 +453,39 @@ static bool writeback_throttling_sane(struct scan_control 
*sc)
 }
 #endif
 
+static long xchg_nr_deferred(struct shrinker *shrinker,
+struct shrink_control *sc)
+{
+   int nid = sc->nid;
+
+   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+   nid = 0;
+
+   if (sc->memcg &&
+   (shrinker->flags & SHRINKER_MEMCG_AWARE))
+   return xchg_nr_deferred_memcg(nid, shrinker,
+ sc->memcg);
+
+   return atomic_long_xchg(>nr_deferred[nid], 0);
+}
+
+
+static long add_nr_deferred(long nr, struct shrinker *shrinker,
+   struct shrink_control *sc)
+{
+   int nid = sc->nid;
+
+   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+   nid = 0;
+
+   if (sc->memcg &&
+   (shrinker->flags & SHRINKER_MEMCG_AWARE))
+   return add_nr_deferred_memcg(nr, nid, shrinker,
+sc->memcg);
+
+   return atomic_long_add_return(nr, >nr_deferred[nid]);
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -559,14 +622,10 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
long freeable;
long nr;
long new_nr;
-   int nid = shrinkctl->nid;
long batch_size = shrinker->batch ? shrinker->batch
  : SHRINK_BATCH;
long scanned = 0, next_deferred;
 
-   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-   nid = 0;
-
freeable = shrinker->count_objects(shrinker, shrinkctl);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;
@@ -576,7 +635,7 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
 * and zero it so that other concurrent shrinker invocations
 * don't also do this scanning work.
 */
-   nr = atomic_long_xchg(>nr_deferred[nid], 0);
+   nr = xchg_nr_deferred(shrinker, shrinkctl);
 
total_scan = nr;
if (shrinker->seeks) {
@@ -667,14 +726,9 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
next_deferred = 0;
/*
 * move the unused scan count back into the shrinker in a
-* manner that handles concurrent updates. If we exhausted the
-* scan, there is no need to do an update.
+* manner that handles concurrent updates.
 */
-   if (next_deferred > 0)
-   new_nr = atomic_long_add_return(next_deferred,
-   >nr_deferred[nid]);
-   else
-   new_nr = atomic_long_read(>nr_deferred[nid]);
+   new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
 
trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, 
total_scan);
return freed;
-- 
2.26.2

[v9 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered

2021-03-10 Thread Yang Shi

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred at the shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
from unregistering correctly.

Remove SHRINKER_REGISTERING since we could check if shrinker is registered
successfully by the new flag.

Acked-by: Kirill Tkhai 
Acked-by: Vlastimil Babka 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/shrinker.h |  7 ---
 mm/vmscan.c  | 40 +++-
 2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1eac79ce57d4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -79,13 +79,14 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE(1 << 0)
-#define SHRINKER_MEMCG_AWARE   (1 << 1)
+#define SHRINKER_REGISTERED(1 << 0)
+#define SHRINKER_NUMA_AWARE(1 << 1)
+#define SHRINKER_MEMCG_AWARE   (1 << 2)
 /*
  * It just makes sense when the shrinker is also MEMCG_AWARE for now,
  * non-MEMCG_AWARE shrinker should not have this flag set.
  */
-#define SHRINKER_NONSLAB   (1 << 2)
+#define SHRINKER_NONSLAB   (1 << 3)
 
 extern int prealloc_shrinker(struct shrinker *shrinker);
 extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c0d04f242917..d0876970601e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -314,19 +314,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
int shrinker_id)
}
 }
 
-/*
- * We allow subsystems to populate their shrinker-related
- * LRU lists before register_shrinker_prepared() is called
- * for the shrinker, since we don't want to impose
- * restrictions on their internal registration order.
- * In this case shrink_slab_memcg() may find corresponding
- * bit is set in the shrinkers map.
- *
- * This value is used by the function to detect registering
- * shrinkers and to skip do_shrink_slab() calls for them.
- */
-#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
-
 static DEFINE_IDR(shrinker_idr);
 
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
@@ -335,7 +322,7 @@ static int prealloc_memcg_shrinker(struct shrinker 
*shrinker)
 
down_write(_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
-   id = idr_alloc(_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
+   id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
 
@@ -358,9 +345,9 @@ static void unregister_memcg_shrinker(struct shrinker 
*shrinker)
 
BUG_ON(id < 0);
 
-   down_write(_rwsem);
+   lockdep_assert_held(_rwsem);
+
idr_remove(_idr, id);
-   up_write(_rwsem);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
@@ -488,8 +475,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
if (!shrinker->nr_deferred)
return;
 
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+   if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+   down_write(_rwsem);
unregister_memcg_shrinker(shrinker);
+   up_write(_rwsem);
+   }
 
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
@@ -499,10 +489,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 {
down_write(_rwsem);
list_add_tail(>list, _list);
-#ifdef CONFIG_MEMCG
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-   idr_replace(_idr, shrinker, shrinker->id);
-#endif
+   shrinker->flags |= SHRINKER_REGISTERED;
up_write(_rwsem);
 }
 
@@ -522,13 +509,16 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-   if (!shrinker->nr_deferred)
+   if (!(shrinker->flags & SHRINKER_REGISTERED))
return;
-   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-   unregister_memcg_shrinker(shrinker);
+
down_write(_rwsem);
list_del(>list);
+   shrinker->flags &= ~SHRINKER_REGISTERED;
+   if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+   unregister_memcg_shrinker(shrinker);
up_write(_rwsem);
+
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
 }
@@ -693,7 +683,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int 
nid,
struct shrinker *shrinker;
 
shrinker = idr_find(_idr, i);
-   if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
+   if (unlikely(!shrinke

[v9 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

2021-03-10 Thread Yang Shi

Currently the number of deferred objects are per shrinker, but some slabs, for
example, vfs inode/dentry cache are per memcg, this would result in poor
isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations, one
memcg with excessive __GFP_NOFS allocations may blow up deferred objects, then
other innocent memcgs may suffer from over shrink, excessive reclaim latency,
etc.

For example, two workloads run in memcgA and memcgB respectively, workload in
B is vfs heavy workload.  Workload in A generates excessive deferred objects,
then B's vfs cache might be hit heavily (drop half of caches) by B's limit
reclaim or global reclaim.

We observed this hit in our production environment which was running vfs heavy
workload shown as the below tracing log:

<...>-409454 [016]  28286961.747146: mm_shrink_slab_start: 
super_cache_scan+0x0/0x1a0 9a83046f3458:
nid: 1 objects to shrink 3641681686040 gfp_flags 
GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
cache items 246404277 delta 31345 total_scan 123202138
<...>-409454 [022]  28287105.928018: mm_shrink_slab_end: 
super_cache_scan+0x0/0x1a0 9a83046f3458:
nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 
602
last shrinker return val 123186855

The vfs cache and page cache ratio was 10:1 on this machine, and half of caches
were dropped.  This also resulted in significant amount of page caches were
dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness
and bring better isolation.

The following patch will add nr_deferred to parent memcg when memcg offline.
To preserve nr_deferred when reparenting memcgs to root, root memcg needs
shrinker_info allocated too.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's
nr_deferred would be used.  And non memcg aware shrinkers use shrinker's
nr_deferred all the time.

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  7 +++--
 mm/vmscan.c| 60 ++
 2 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 308a202f1de2..42a4facb5b7c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,12 +97,13 @@ struct batched_lruvec_stat {
 };
 
 /*
- * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
- * which have elements charged to this memcg.
+ * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
+ * shrinkers, which have elements charged to this memcg.
  */
 struct shrinker_info {
struct rcu_head rcu;
-   unsigned long map[];
+   atomic_long_t *nr_deferred;
+   unsigned long *map;
 };
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d0876970601e..ae82afe6cec6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
 
+/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
 static inline int shrinker_map_size(int nr_items)
 {
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
+static inline int shrinker_defer_size(int nr_items)
+{
+   return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
+}
+
 static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 int nid)
 {
@@ -200,10 +206,12 @@ static struct shrinker_info 
*shrinker_info_protected(struct mem_cgroup *memcg,
 }
 
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
-   int size, int old_size)
+   int map_size, int defer_size,
+   int old_map_size, int old_defer_size)
 {
struct shrinker_info *new, *old;
int nid;
+   int size = map_size + defer_size;
 
for_each_node(nid) {
old = shrinker_info_protected(memcg, nid);
@@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup 
*memcg,
if (!new)
return -ENOMEM;
 
-   /* Set all old bits, clear all new bits */
-   memset(new->map, (int)0xff, old_size);
-   memset((void *)new->map + old_size, 0, size - old_size);
+   new->nr_deferred = (atomic_long_t *)(new + 1);
+   new->map = (void *)new->nr_deferred + defer_size;
+
+   /* map: set all old bits, clear all new bits */
+   memset(new->map, (int)0xff, old_map_size);
+   memset((void *)new->map + old_map_size, 0, map_size - 
old_map_size);
+   /* nr_deferred: copy old values, clear all new values */
+   memcpy(new->nr_deferred, old->nr_deferred, old_de

[v9 PATCH 06/13] mm: memcontrol: rename shrinker_map to shrinker_info

2021-03-10 Thread Yang Shi

The following patch is going to add nr_deferred into shrinker_map, the change 
will
make shrinker_map not only include map anymore, so rename it to 
"memcg_shrinker_info".
And this should make the patch adding nr_deferred cleaner and readable and make
review easier.  Also remove the "memcg_" prefix.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  8 +++---
 mm/memcontrol.c|  6 ++--
 mm/vmscan.c| 58 +++---
 3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fb2b7ef298ec..308a202f1de2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -100,7 +100,7 @@ struct batched_lruvec_stat {
  * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
  * which have elements charged to this memcg.
  */
-struct memcg_shrinker_map {
+struct shrinker_info {
struct rcu_head rcu;
unsigned long map[];
 };
@@ -128,7 +128,7 @@ struct mem_cgroup_per_node {
 
struct mem_cgroup_reclaim_iter  iter;
 
-   struct memcg_shrinker_map __rcu *shrinker_map;
+   struct shrinker_info __rcu  *shrinker_info;
 
struct rb_node  tree_node;  /* RB tree node */
unsigned long   usage_in_excess;/* Set to the value by which */
@@ -1565,8 +1565,8 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
 }
 
-int alloc_shrinker_maps(struct mem_cgroup *memcg);
-void free_shrinker_maps(struct mem_cgroup *memcg);
+int alloc_shrinker_info(struct mem_cgroup *memcg);
+void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a89c960f768e..edd8a06c751f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5226,11 +5226,11 @@ static int mem_cgroup_css_online(struct 
cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
/*
-* A memcg must be visible for expand_shrinker_maps()
+* A memcg must be visible for expand_shrinker_info()
 * by the time the maps are allocated. So, we allocate maps
 * here, when for_each_mem_cgroup() can't skip it.
 */
-   if (alloc_shrinker_maps(memcg)) {
+   if (alloc_shrinker_info(memcg)) {
mem_cgroup_id_remove(memcg);
return -ENOMEM;
}
@@ -5294,7 +5294,7 @@ static void mem_cgroup_css_free(struct 
cgroup_subsys_state *css)
vmpressure_cleanup(>vmpressure);
cancel_work_sync(>high_work);
mem_cgroup_remove_from_trees(memcg);
-   free_shrinker_maps(memcg);
+   free_shrinker_info(memcg);
memcg_free_kmem(memcg);
mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c9898e66011e..7f3c00e76fd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,15 +192,15 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
-static int expand_one_shrinker_map(struct mem_cgroup *memcg,
-  int size, int old_size)
+static int expand_one_shrinker_info(struct mem_cgroup *memcg,
+   int size, int old_size)
 {
-   struct memcg_shrinker_map *new, *old;
+   struct shrinker_info *new, *old;
int nid;
 
for_each_node(nid) {
old = rcu_dereference_protected(
-   mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+   mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
/* Not yet online memcg */
if (!old)
return 0;
@@ -213,17 +213,17 @@ static int expand_one_shrinker_map(struct mem_cgroup 
*memcg,
memset(new->map, (int)0xff, old_size);
memset((void *)new->map + old_size, 0, size - old_size);
 
-   rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
+   rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
kvfree_rcu(old, rcu);
}
 
return 0;
 }
 
-void free_shrinker_maps(struct mem_cgroup *memcg)
+void free_shrinker_info(struct mem_cgroup *memcg)
 {
struct mem_cgroup_per_node *pn;
-   struct memcg_shrinker_map *map;
+   struct shrinker_info *info;
int nid;
 
if (mem_cgroup_is_root(memcg))
@@ -231,15 +231,15 @@ void free_shrinker_maps(struct mem_cgroup *memcg)
 
for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
-   map = rcu_dereference_protected(pn->shrinker_map, true);
-   kvfree(map);
-

[v9 PATCH 07/13] mm: vmscan: add shrinker_info_protected() helper

2021-03-10 Thread Yang Shi

The shrinker_info is dereferenced in a couple of places via 
rcu_dereference_protected
with different calling conventions, for example, using mem_cgroup_nodeinfo 
helper
or dereferencing memcg->nodeinfo[nid]->shrinker_info.  And the later patch
will add more dereference places.

So extract the dereference into a helper to make the code more readable.  No
functional change.

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Acked-by: Vlastimil Babka 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f3c00e76fd1..c0d04f242917 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,6 +192,13 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
+static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
+int nid)
+{
+   return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+lockdep_is_held(_rwsem));
+}
+
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int size, int old_size)
 {
@@ -199,8 +206,7 @@ static int expand_one_shrinker_info(struct mem_cgroup 
*memcg,
int nid;
 
for_each_node(nid) {
-   old = rcu_dereference_protected(
-   mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
+   old = shrinker_info_protected(memcg, nid);
/* Not yet online memcg */
if (!old)
return 0;
@@ -231,7 +237,7 @@ void free_shrinker_info(struct mem_cgroup *memcg)
 
for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
-   info = rcu_dereference_protected(pn->shrinker_info, true);
+   info = shrinker_info_protected(memcg, nid);
kvfree(info);
rcu_assign_pointer(pn->shrinker_info, NULL);
}
@@ -674,8 +680,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int 
nid,
if (!down_read_trylock(_rwsem))
return 0;
 
-   info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
-true);
+   info = shrinker_info_protected(memcg, nid);
if (unlikely(!info))
goto unlock;
 
-- 
2.26.2

[v9 PATCH 04/13] mm: vmscan: remove memcg_shrinker_map_size

2021-03-10 Thread Yang Shi

Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep 
both.
Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating 
the
bit map.

Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Acked-by: Vlastimil Babka 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 75fd8038a6c8..bda67e1ac84b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,8 +185,12 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_MEMCG
+static int shrinker_nr_max;
 
-static int memcg_shrinker_map_size;
+static inline int shrinker_map_size(int nr_items)
+{
+   return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
+}
 
 static void free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -247,7 +251,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
return 0;
 
down_write(_rwsem);
-   size = memcg_shrinker_map_size;
+   size = shrinker_map_size(shrinker_nr_max);
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
if (!map) {
@@ -265,12 +269,13 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
 static int expand_shrinker_maps(int new_id)
 {
int size, old_size, ret = 0;
+   int new_nr_max = new_id + 1;
struct mem_cgroup *memcg;
 
-   size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
-   old_size = memcg_shrinker_map_size;
+   size = shrinker_map_size(new_nr_max);
+   old_size = shrinker_map_size(shrinker_nr_max);
if (size <= old_size)
-   return 0;
+   goto out;
 
if (!root_mem_cgroup)
goto out;
@@ -289,7 +294,7 @@ static int expand_shrinker_maps(int new_id)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 out:
if (!ret)
-   memcg_shrinker_map_size = size;
+   shrinker_nr_max = new_nr_max;
 
return ret;
 }
@@ -322,7 +327,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
int shrinker_id)
 #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
 
 static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;
 
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
@@ -339,8 +343,6 @@ static int prealloc_memcg_shrinker(struct shrinker 
*shrinker)
idr_remove(_idr, id);
goto unlock;
}
-
-   shrinker_nr_max = id + 1;
}
shrinker->id = id;
ret = 0;
-- 
2.26.2

[v9 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

2021-03-10 Thread Yang Shi

Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
We don't have to define a dedicated callback for call_rcu() anymore.

Acked-by: Roman Gushchin 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bda67e1ac84b..c9898e66011e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
 }
 
-static void free_shrinker_map_rcu(struct rcu_head *head)
-{
-   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
-}
-
 static int expand_one_shrinker_map(struct mem_cgroup *memcg,
   int size, int old_size)
 {
@@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
memset((void *)new->map + old_size, 0, size - old_size);
 
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
-   call_rcu(>rcu, free_shrinker_map_rcu);
+   kvfree_rcu(old, rcu);
}
 
return 0;
-- 
2.26.2

[v9 PATCH 03/13] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation

2021-03-10 Thread Yang Shi

Since memcg_shrinker_map_size just can be changed under holding shrinker_rwsem
exclusively, the read side can be protected by holding read lock, so it sounds
superfluous to have a dedicated mutex.

Kirill Tkhai suggested use write lock since:

  * We want the assignment to shrinker_maps is visible for shrink_slab_memcg().
  * The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
is not actually protected.
  * READ lock makes alloc_shrinker_info() racy against memory allocation fail.
alloc_shrinker_info()->free_shrinker_info() may free memory right after
shrink_slab_memcg() dereferenced it. You may say
shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
but this is not the thing we want to remember in the future, since this
spreads modularity.

And a test with heavy paging workload didn't show write lock makes things worse.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ad164f3af9a0..75fd8038a6c8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_MEMCG
 
 static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
 
 static void free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -200,8 +199,6 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
struct memcg_shrinker_map *new, *old;
int nid;
 
-   lockdep_assert_held(_shrinker_map_mutex);
-
for_each_node(nid) {
old = rcu_dereference_protected(
mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
@@ -249,7 +246,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
if (mem_cgroup_is_root(memcg))
return 0;
 
-   mutex_lock(_shrinker_map_mutex);
+   down_write(_rwsem);
size = memcg_shrinker_map_size;
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
@@ -260,7 +257,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
}
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
}
-   mutex_unlock(_shrinker_map_mutex);
+   up_write(_rwsem);
 
return ret;
 }
@@ -275,9 +272,10 @@ static int expand_shrinker_maps(int new_id)
if (size <= old_size)
return 0;
 
-   mutex_lock(_shrinker_map_mutex);
if (!root_mem_cgroup)
-   goto unlock;
+   goto out;
+
+   lockdep_assert_held(_rwsem);
 
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
@@ -286,13 +284,13 @@ static int expand_shrinker_maps(int new_id)
ret = expand_one_shrinker_map(memcg, size, old_size);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
-   goto unlock;
+   goto out;
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-unlock:
+out:
if (!ret)
memcg_shrinker_map_size = size;
-   mutex_unlock(_shrinker_map_mutex);
+
return ret;
 }
 
-- 
2.26.2

[v9 PATCH 02/13] mm: vmscan: consolidate shrinker_maps handling code

2021-03-10 Thread Yang Shi

The shrinker map management is not purely memcg specific, it is at the 
intersection
between memory cgroup and shrinkers.  It's allocation and assignment of a 
structure,
and the only memcg bit is the map is being stored in a memcg structure.  So 
move the
shrinker_maps handling code into vmscan.c for tighter integration with shrinker 
code,
and remove the "memcg_" prefix.  There is no functional change.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Signed-off-by: Yang Shi 
---
 include/linux/memcontrol.h |  11 ++--
 mm/huge_memory.c   |   4 +-
 mm/list_lru.c  |   6 +-
 mm/memcontrol.c| 129 +---
 mm/vmscan.c| 131 -
 5 files changed, 141 insertions(+), 140 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..fb2b7ef298ec 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1565,10 +1565,9 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
 }
 
-extern int memcg_expand_shrinker_maps(int new_id);
-
-extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
-  int nid, int shrinker_id);
+int alloc_shrinker_maps(struct mem_cgroup *memcg);
+void free_shrinker_maps(struct mem_cgroup *memcg);
+void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 #else
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
@@ -1578,8 +1577,8 @@ static inline bool 
mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
 }
 
-static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
- int nid, int shrinker_id)
+static inline void set_shrinker_bit(struct mem_cgroup *memcg,
+   int nid, int shrinker_id)
 {
 }
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..e8008d2f8497 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2842,8 +2842,8 @@ void deferred_split_huge_page(struct page *page)
ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
if (memcg)
-   memcg_set_shrinker_bit(memcg, page_to_nid(page),
-  deferred_split_shrinker.id);
+   set_shrinker_bit(memcg, page_to_nid(page),
+deferred_split_shrinker.id);
 #endif
}
spin_unlock_irqrestore(_queue->split_queue_lock, flags);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 6f067b6b935f..cd58790d0fb3 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -125,8 +125,8 @@ bool list_lru_add(struct list_lru *lru, struct list_head 
*item)
list_add_tail(item, >list);
/* Set shrinker bit if the first element was added */
if (!l->nr_items++)
-   memcg_set_shrinker_bit(memcg, nid,
-  lru_shrinker_id(lru));
+   set_shrinker_bit(memcg, nid,
+lru_shrinker_id(lru));
nlru->nr_items++;
spin_unlock(>lock);
return true;
@@ -540,7 +540,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, 
int nid,
 
if (src->nr_items) {
dst->nr_items += src->nr_items;
-   memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
+   set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
src->nr_items = 0;
}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..a89c960f768e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -402,129 +402,6 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
-static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
-
-static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
-{
-   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
-}
-
-static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
-int size, int old_size)
-{
-   struct memcg_shrinker_map *new, *old;
-   int nid;
-
-   lockdep_assert_held(_shrinker_map_mutex);
-
-   for_each_node(nid) {
-   old = rcu_dereference_protected(
-   mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
-   /* Not yet online memcg */
-   if (!old)
-   return 0;
-
-   new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
-   if (!new)
-   return -ENOMEM;
-
-   /* Set all old bits, clear all new bi

[v9 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-03-10 Thread Yang Shi

hrink half of dentry cache in just one loop as the below 
tracing result
showed:

kswapd0-475   [028]  305968.252561: mm_shrink_slab_start: 
super_cache_scan+0x0/0x190 24acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 
45746 total_scan 46844936 priority 12
kswapd0-475   [021]  306013.099399: mm_shrink_slab_end: 
super_cache_scan+0x0/0x190 24acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker 
return val 46844928

There were huge number of deferred objects before the shrinker was called, the 
behavior
does match the code but it might be not desirable from the user's stand of 
point.

The excessive amount of nr_deferred might be accumulated due to various 
reasons, for example:
* GFP_NOFS allocation
* Significant times of small amount scan (< scan_batch, 1024 for vfs 
metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the 
deferred objects
is per shrinker, this may have some bad effects:
* Poor isolation among memcgs. Some memcgs which happen to have frequent 
limit
  reclaim may get nr_deferred accumulated to a huge number, then other 
innocent
  memcgs may take the fall. In our case the main workload was hit.
* Unbounded deferred objects. There is no cap for deferred objects, it can 
outgrow
  ridiculously as the tracing result showed.
* Easy to get out of control. Although shrinkers take into account deferred 
objects,
  but it can go out of control easily. One misconfigured memcg could incur 
absurd 
  amount of deferred objects in a period of time.
* Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. 
There may be
  hundred GB slab caches for vfe metadata heavy workload, shrink half of 
them may take
  minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in 
https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828...@gmail.com/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
* Have memcg_shrinker_deferred per memcg per node, just like what 
shrinker_map
  does. Instead it is an atomic_long_t array, each element represent one 
shrinker
  even though the shrinker is not memcg aware, this simplifies the 
implementation.
  For memcg aware shrinkers, the deferred objects are just accumulated to 
its own
  memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg 
aware
  shrinkers still use global nr_deferred from struct shrinker.
* Once the memcg is offlined, its nr_deferred will be reparented to its 
parent along
  with LRUs.
* The root memcg has memcg_shrinker_deferred array too. It simplifies the 
handling of
  reparenting to root memcg.
* Cap nr_deferred to 2x of the length of lru. The idea is borrowed from 
Dave Chinner's
  series 
(https://lore.kernel.org/linux-xfs/20191031234618.15403-1-da...@fromorbit.com/)

The downside is each memcg has to allocate extra memory to store the 
nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each 
memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and 
production) for
months, it works very well. The monitor data shows the working set is sustained 
as expected.

Yang Shi (13):
  mm: vmscan: use nid from shrink_control for tracepoint
  mm: vmscan: consolidate shrinker_maps handling code
  mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  mm: vmscan: remove memcg_shrinker_map_size
  mm: vmscan: use kvfree_rcu instead of call_rcu
  mm: memcontrol: rename shrinker_map to shrinker_info
  mm: vmscan: add shrinker_info_protected() helper
  mm: vmscan: use a new flag to indicate shrinker is registered
  mm: vmscan: add per memcg shrinker nr_deferred
  mm: vmscan: use per memcg nr_deferred of shrinker
  mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware 
shrinkers
  mm: memcontrol: reparent nr_deferred when memcg offline
  mm: vmscan: shrink deferred objects proportional to priority

 include/linux/memcontrol.h |  23 +++---
 include/linux/shrinker.h   |   7 +-
 mm/huge_memory.c   |   4 +-
 mm/list_lru.c  |   6 +-
 mm/memcontrol.c| 130 +--
 mm/vmscan.c| 394 

 6 files changed, 319 insertions(+), 245 deletions(-)

[v9 PATCH 01/13] mm: vmscan: use nid from shrink_control for tracepoint

2021-03-10 Thread Yang Shi

The tracepoint's nid should show what node the shrink happens on, the start 
tracepoint
uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if 
the
shrinker is not NUMA aware, so the tracing log may show the shrink happens on 
one
node but end up on the other node.  It seems confusing.  And the following patch
will remove using nid directly in do_shrink_slab(), this patch also helps 
cleanup
the code.

Acked-by: Vlastimil Babka 
Acked-by: Kirill Tkhai 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Signed-off-by: Yang Shi 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..31d116ea59a9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,7 +536,7 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
else
new_nr = atomic_long_read(>nr_deferred[nid]);
 
-   trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+   trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, 
total_scan);
return freed;
 }
 
-- 
2.26.2

Re: [PATCH 00/10] [v6] Migrate Pages in lieu of discard

2021-03-08 Thread Yang Shi

On Thu, Mar 4, 2021 at 4:00 PM Dave Hansen  wrote:
>
>
> The full series is also available here:
>
> https://github.com/hansendc/linux/tree/automigrate-20210304
>
> which also inclues some vm.zone_reclaim_mode sysctl ABI fixup
> prerequisites.
>
> The meat of this patch is in:
>
> [PATCH 05/10] mm/migrate: demote pages during reclaim
>
> Which also has the most changes since the last post.  This version is
> mostly to address review comments from Yang Shi and Oscar Salvador.
> Review comments are documented in the individual patch changelogs.
>
> This also contains a few prerequisite patches that fix up an issue
> with the vm.zone_reclaim_mode sysctl ABI.
>
> Changes since (automigrate-20210122):
>  * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE since pages *are*
>movable.
>  * Separate out helpers that check for being able to relaim anonymous
>pages versus being able to meaningfully scan the anon LRU.
>
> --
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
> http://lkml.kernel.org/r/20190124231441.37a4a...@viggo.jf.intel.com
>
> == Open Issues ==
>
>  * For cpusets and memory policies that restrict allocations
>to PMEM, is it OK to demote to PMEM?  Do we need a cgroup-
>level API to opt-in or opt-out of these migrations?

I'm wondering if such usecases, which don't want to have memory
allocate on pmem, will allow memory swapped out or reclaimed? If swap
is allowed then I failed to see why migrating to pmem should be
disallowed. If swap is not allowed, they should call mlock, then the
memory won't be migrated to pmem as well.

>  * Could be more aggressive about where anon LRU scanning occurs
>since it no longer necessarily involves I/O.  get_scan_count()
>for instance says: "If we have no swap space, do not bother
>scanning anon pages"

Yes, I agree. Johannes's patchset
(https://lore.kernel.org/linux-mm/20200520232525.798933-1-han...@cmpxchg.org/#r)
has lifted the swappiness to 200 so anonymous lru could be scanned
more aggressively. We definitely could tweak this if needed.

>
> --
>
>  Documentation/admin-guide/sysctl/vm.rst |9
>  include/linux/migrate.h |   20 +
>  include/linux/swap.h|3
>  include/linux/vm_event_item.h   |2
>  include/trace/events/migrate.h  |3
>  include/uapi/linux/mempolicy.h  |1
>  mm/compaction.c |3
>  mm/gup.c|4
>  mm/internal.h   |5
>  mm/memory-failure.c |4
>  mm/memory_hotplug.c |4
>  mm/mempolicy.c  |8
>  mm/migrate.c|  369 
> +---
>  mm/page_alloc.c |   13 -
>  mm/vmscan.c |  173 +--
>  mm/vmstat.c |2
>  16 files changed, 560 insertions(+), 63 deletions(-)
>
> --
>
> Changes since (automigrate-20200818):
>  * Fall back to normal reclaim when demotion fails
>  * Fix some compile issues, when page migration and NUMA are off
>
> Changes since (automigrate-20201007):
>  * separate out checks for "can scan anon LRU" from "can actually
>swap anon pages right now".  Previous series conflated them
>and may have been overly aggressive scanning LRU
>  * add MR_DEMOTION to tr

Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-03-08 Thread Yang Shi

On Thu, Mar 4, 2021 at 4:01 PM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> Once this is enabled page demotion may move data to a NUMA node
> that does not fall into the cpuset of the allocating process.
> This could be construed to violate the guarantees of cpusets.
> However, since this is an opt-in mechanism, the assumption is
> that anyone enabling it is content to relax the guarantees.

I think we'd better have the cpuset violation paragraph along with new
zone reclaim mode text so that the users are aware of the potential
violation. I don't think commit log is the to-go place for any plain
users.

>
> Signed-off-by: Dave Hansen 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> changes since 20200122:
>  * Changelog material about relaxing cpuset constraints
> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |9 +
>  b/include/linux/swap.h|3 ++-
>  b/include/uapi/linux/mempolicy.h  |1 +
>  b/mm/vmscan.c |6 --
>  4 files changed, 16 insertions(+), 3 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
> Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-04 
> 15:36:26.078806355 -0800
> +++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-04 15:36:26.093806355 
> -0800
> @@ -976,6 +976,7 @@ This is value OR'ed together of
>  1  Zone reclaim on
>  2  Zone reclaim writes dirty pages out
>  4  Zone reclaim swaps pages
> +8  Zone reclaim migrates pages
>  =  ===
>
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -1000,3 +1001,11 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.
> diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
> --- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-04 15:36:26.082806355 
> -0800
> +++ b/include/linux/swap.h  2021-03-04 15:36:26.093806355 -0800
> @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
>  static inline bool node_reclaim_enabled(void)
>  {
> /* Is any node_reclaim_mode bit set? */
> -   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
> +   RECLAIM_UNMAP|RECLAIM_MIGRATE);
>  }
>
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
> include/uapi/linux/mempolicy.h
> --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-04 
> 15:36:26.084806355 -0800
> +++ b/include/uapi/linux/mempolicy.h2021-03-04 15:36:26.094806355 -0800
> @@ -69,5 +69,6 @@ enum {
>  #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
>  #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
>  #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during rec

Re: [PATCH 09/10] mm/vmscan: never demote for memcg reclaim

2021-03-08 Thread Yang Shi

On Thu, Mar 4, 2021 at 4:02 PM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> Global reclaim aims to reduce the amount of memory used on
> a given node or set of nodes.  Migrating pages to another
> node serves this purpose.
>
> memcg reclaim is different.  Its goal is to reduce the
> total memory consumption of the entire memcg, across all
> nodes.  Migration does not assist memcg reclaim because
> it just moves page contents between nodes rather than
> actually reducing memory consumption.

Reviewed-by: Yang Shi 

>
> Signed-off-by: Dave Hansen 
> Suggested-by: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> ---
>
>  b/mm/vmscan.c |   18 --
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
> --- a/mm/vmscan.c~never-demote-for-memcg-reclaim2021-03-04 
> 15:36:01.067806417 -0800
> +++ b/mm/vmscan.c   2021-03-04 15:36:01.072806417 -0800
> @@ -288,7 +288,8 @@ static bool writeback_throttling_sane(st
>  #endif
>
>  static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
> - int node_id)
> + int node_id,
> + struct scan_control *sc)
>  {
> if (memcg == NULL) {
> /*
> @@ -326,7 +327,7 @@ unsigned long zone_reclaimable_pages(str
>
> nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
> -   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
> +   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
> nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
>
> @@ -1063,7 +1064,8 @@ static enum page_references page_check_r
> return PAGEREF_RECLAIM;
>  }
>
> -static bool migrate_demote_page_ok(struct page *page)
> +static bool migrate_demote_page_ok(struct page *page,
> +  struct scan_control *sc)
>  {
> int next_nid = next_demotion_node(page_to_nid(page));
>
> @@ -1071,6 +1073,10 @@ static bool migrate_demote_page_ok(struc
> VM_BUG_ON_PAGE(PageHuge(page), page);
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> +   /* It is pointless to do demotion in memcg reclaim */
> +   if (cgroup_reclaim(sc))
> +   return false;
> +
> if (next_nid == NUMA_NO_NODE)
> return false;
> if (PageTransHuge(page) && !thp_migration_supported())
> @@ -1326,7 +1332,7 @@ retry:
>  * Before reclaiming the page, try to relocate
>  * its contents to another node.
>  */
> -   if (do_demote_pass && migrate_demote_page_ok(page)) {
> +   if (do_demote_pass && migrate_demote_page_ok(page, sc)) {
> list_add(>lru, _pages);
> unlock_page(page);
> continue;
> @@ -2371,7 +2377,7 @@ static void get_scan_count(struct lruvec
> enum lru_list lru;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> -   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
> +   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, 
> sc)) {
> scan_balance = SCAN_FILE;
> goto out;
> }
> @@ -2746,7 +2752,7 @@ static inline bool should_continue_recla
>  */
> pages_for_compaction = compact_gap(sc->order);
> inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -   if (can_reclaim_anon_pages(NULL, pgdat->node_id))
> +   if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
> inactive_lru_pages += node_page_state(pgdat, 
> NR_INACTIVE_ANON);
>
> return inactive_lru_pages > pages_for_compaction;
> _
>

Re: [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-03-08 Thread Yang Shi

On Thu, Mar 4, 2021 at 4:01 PM Dave Hansen  wrote:
>
>
> From: Keith Busch 
>
> Reclaim anonymous pages if a migration path is available now that
> demotion provides a non-swap recourse for reclaiming anon pages.
>
> Note that this check is subtly different from the
> anon_should_be_aged() checks.  This mechanism checks whether a
> specific page in a specific context *can* actually be reclaimed, given
> current swap space and cgroup limits
>
> anon_should_be_aged() is a much simpler and more prelimiary check

Just a typo, s/prelimiary/preliminary

> which just says whether there is a possibility of future reclaim.

Reviewed-by: Yang Shi 

>
> #Signed-off-by: Keith Busch 
> Cc: Keith Busch 
> Signed-off-by: Dave Hansen 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> --
>
> Changes from Dave 10/2020:
>  * remove 'total_swap_pages' modification
>
> Changes from Dave 06/2020:
>  * rename reclaim_anon_pages()->can_reclaim_anon_pages()
>
> Note: Keith's Intel SoB is commented out because he is no
> longer at Intel and his @intel.com mail will bounce.
> ---
>
>  b/mm/vmscan.c |   35 ---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 
> mm/vmscan.c
> --- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap  
> 2021-03-04 15:35:59.994806420 -0800
> +++ b/mm/vmscan.c   2021-03-04 15:36:00.001806420 -0800
> @@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
>  }
>  #endif
>
> +static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
> + int node_id)
> +{
> +   if (memcg == NULL) {
> +   /*
> +* For non-memcg reclaim, is there
> +* space in any swap device?
> +*/
> +   if (get_nr_swap_pages() > 0)
> +   return true;
> +   } else {
> +   /* Is the memcg below its swap limit? */
> +   if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
> +   return true;
> +   }
> +
> +   /*
> +* The page can not be swapped.
> +*
> +* Can it be reclaimed from this node via demotion?
> +*/
> +   if (next_demotion_node(node_id) >= 0)
> +   return true;
> +
> +   /* No way to reclaim anon pages */
> +   return false;
> +}
> +
>  /*
>   * This misses isolated pages which are not accounted for to save counters.
>   * As the data only determines if reclaim or compaction continues, it is
> @@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
>
> nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
> -   if (get_nr_swap_pages() > 0)
> +   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
> nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
>
> @@ -2332,6 +2360,7 @@ enum scan_balance {
>  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>unsigned long *nr)
>  {
> +   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> unsigned long anon_cost, file_cost, total_cost;
> int swappiness = mem_cgroup_swappiness(memcg);
> @@ -2342,7 +2371,7 @@ static void get_scan_count(struct lruvec
> enum lru_list lru;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> -   if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
> +   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
> scan_balance = SCAN_FILE;
> goto out;
> }
> @@ -2717,7 +2746,7 @@ static inline bool should_continue_recla
>  */
> pages_for_compaction = compact_gap(sc->order);
> inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -   if (get_nr_swap_pages() > 0)
> +   if (can_reclaim_anon_pages(NULL, pgdat->node_id))
> inactive_lru_pages += node_page_state(pgdat, 
> NR_INACTIVE_ANON);
>
> return inactive_lru_pages > pages_for_compaction;
> _
>

Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-03-08 Thread Yang Shi

On Thu, Mar 4, 2021 at 4:01 PM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> Anonymous pages are kept on their own LRU(s).  These lists could
> theoretically always be scanned and maintained.  But, without swap,
> there is currently nothing the kernel can *do* with the results of a
> scanned, sorted LRU for anonymous pages.
>
> A check for '!total_swap_pages' currently serves as a valid check as
> to whether anonymous LRUs should be maintained.  However, another
> method will be added shortly: page demotion.
>
> Abstract out the 'total_swap_pages' checks into a helper, give it a
> logically significant name, and check for the possibility of page
> demotion.

Reviewed-by: Yang Shi 

>
> Signed-off-by: Dave Hansen 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> ---
>
>  b/mm/vmscan.c |   28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged2021-03-04 15:35:58.935806422 
> -0800
> +++ b/mm/vmscan.c   2021-03-04 15:35:58.942806422 -0800
> @@ -2517,6 +2517,26 @@ out:
> }
>  }
>
> +/*
> + * Anonymous LRU management is a waste if there is
> + * ultimately no way to reclaim the memory.
> + */
> +bool anon_should_be_aged(struct lruvec *lruvec)
> +{
> +   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +   /* Aging the anon LRU is valuable if swap is present: */
> +   if (total_swap_pages > 0)
> +   return true;
> +
> +   /* Also valuable if anon pages can be demoted: */
> +   if (next_demotion_node(pgdat->node_id) >= 0)
> +   return true;
> +
> +   /* No way to reclaim anon pages.  Should not age anon LRUs: */
> +   return false;
> +}
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> unsigned long nr[NR_LRU_LISTS];
> @@ -2626,7 +2646,8 @@ static void shrink_lruvec(struct lruvec
>  * Even if we did not try to evict anon pages at all, we want to
>  * rebalance the anon lru active/inactive ratio.
>  */
> -   if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> +   if (anon_should_be_aged(lruvec) &&
> +   inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>sc, LRU_ACTIVE_ANON);
>  }
> @@ -3455,10 +3476,11 @@ static void age_active_anon(struct pglis
> struct mem_cgroup *memcg;
> struct lruvec *lruvec;
>
> -   if (!total_swap_pages)
> +   lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +
> +   if (!anon_should_be_aged(lruvec))
> return;
>
> -   lruvec = mem_cgroup_lruvec(NULL, pgdat);
> if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> return;
>
> _
>

Re: [PATCH 06/10] mm/vmscan: add page demotion counter

2021-03-08 Thread Yang Shi

On Thu, Mar 4, 2021 at 4:01 PM Dave Hansen  wrote:
>
>
> From: Yang Shi 
>
> Account the number of demoted pages into reclaim_state->nr_demoted.
>
> Add pgdemote_kswapd and pgdemote_direct VM counters showed in
> /proc/vmstat.
>
> [ daveh:
>- __count_vm_events() a bit, and made them look at the THP
>  size directly rather than getting data from migrate_pages()
> ]
>
> Signed-off-by: Yang Shi 
> Signed-off-by: Dave Hansen 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> --
>
> Changes since 202010:
>  * remove unused scan-control 'demoted' field

Reviewed-by: Yang Shi 

> ---
>
>  b/include/linux/vm_event_item.h |2 ++
>  b/mm/vmscan.c   |5 +
>  b/mm/vmstat.c   |2 ++
>  3 files changed, 9 insertions(+)
>
> diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 
> include/linux/vm_event_item.h
> --- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 
> 2021-03-04 15:35:57.698806425 -0800
> +++ b/include/linux/vm_event_item.h 2021-03-04 15:35:57.719806425 -0800
> @@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
> PGREUSE,
> PGSTEAL_KSWAPD,
> PGSTEAL_DIRECT,
> +   PGDEMOTE_KSWAPD,
> +   PGDEMOTE_DIRECT,
> PGSCAN_KSWAPD,
> PGSCAN_DIRECT,
> PGSCAN_DIRECT_THROTTLE,
> diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter   2021-03-04 
> 15:35:57.700806425 -0800
> +++ b/mm/vmscan.c   2021-03-04 15:35:57.724806425 -0800
> @@ -1118,6 +1118,11 @@ static unsigned int demote_page_list(str
> target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> _succeeded);
>
> +   if (current_is_kswapd())
> +   __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
> +   else
> +   __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
> +
> return nr_succeeded;
>  }
>
> diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
> --- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter   2021-03-04 
> 15:35:57.708806425 -0800
> +++ b/mm/vmstat.c   2021-03-04 15:35:57.726806425 -0800
> @@ -1244,6 +1244,8 @@ const char * const vmstat_text[] = {
> "pgreuse",
> "pgsteal_kswapd",
> "pgsteal_direct",
> +   "pgdemote_kswapd",
> +   "pgdemote_direct",
> "pgscan_kswapd",
> "pgscan_direct",
> "pgscan_direct_throttle",
> _
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1968 matches

Mail list logo