Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-04-09 Thread Wei Xu
On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  wrote:
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.

Nit: now now -> now

> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).

Nit: it it -> it if

> diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
> --- a/mm/vmscan.c~RECLAIM_MIGRATE   2021-03-31 15:17:40.339000190 -0700
> +++ b/mm/vmscan.c   2021-03-31 15:17:40.357000190 -0700
> @@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc
> VM_BUG_ON_PAGE(PageHuge(page), page);
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> +   if (!(node_reclaim_mode & RECLAIM_MIGRATE))
> +   return false;
> +

As I commented on an earlier patch in this series, the RECLAIM_MIGRATE check
needs to be added to other new callers of next_demotion_node() as well to avoid
unnecessarily splitting THP pages when neither swap nor RECLAIM_MIGRATE
is enabled.  It can be too late to check RECLAIM_MIGRATE only in
migrate_demote_page_ok().


Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-04-01 Thread Yang Shi
On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> Once this is enabled page demotion may move data to a NUMA node
> that does not fall into the cpuset of the allocating process.
> This could be construed to violate the guarantees of cpusets.
> However, since this is an opt-in mechanism, the assumption is
> that anyone enabling it is content to relax the guarantees.
>
> Signed-off-by: Dave Hansen 
> Cc: Wei Xu 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> Changes since 20200122:
>  * Changelog material about relaxing cpuset constraints
>
> Changes since 20210304:
>  * Add Documentation/ material about relaxing cpuset constraints

Reviewed-by: Yang Shi 

> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |   12 
>  b/include/linux/swap.h|3 ++-
>  b/include/uapi/linux/mempolicy.h  |1 +
>  b/mm/vmscan.c |6 --
>  4 files changed, 19 insertions(+), 3 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
> Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-31 
> 15:17:40.324000190 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-31 15:17:40.349000190 
> -0700
> @@ -976,6 +976,7 @@ This is value OR'ed together of
>  1  Zone reclaim on
>  2  Zone reclaim writes dirty pages out
>  4  Zone reclaim swaps pages
> +8  Zone reclaim migrates pages
>  =  ===
>
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -1000,3 +1001,14 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.  It may move data to a NUMA node that does not
> +fall into the cpuset of the allocating process which might be construed
> +to violate the guarantees of cpusets.  This should not be enabled on
> +systems which need strict cpuset location guarantees.
> diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
> --- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-31 15:17:40.331000190 
> -0700
> +++ b/include/linux/swap.h  2021-03-31 15:17:40.351000190 -0700
> @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
>  static inline bool node_reclaim_enabled(void)
>  {
> /* Is any node_reclaim_mode bit set? */
> -   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
> +   RECLAIM_UNMAP|RECLAIM_MIGRATE);
>  }
>
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
> include/uapi/linux/mempolicy.h
> --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-31 
> 15:17:40.337000190 -0700
> +++ b/include/uapi/linux/mempolicy.h2021-03-31 15:17:40.352000190 -0700
> @@ -71,5 +71,6 @@ enum {
>  #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
>  #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
>  #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
> +#define RECLAIM_MIGRATE(1<<3)  /* Migrate to other nodes during 
> reclaim */
>
>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
> diff -puN 

[PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-04-01 Thread Dave Hansen


From: Dave Hansen 

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This proposes extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

Once this is enabled page demotion may move data to a NUMA node
that does not fall into the cpuset of the allocating process.
This could be construed to violate the guarantees of cpusets.
However, since this is an opt-in mechanism, the assumption is
that anyone enabling it is content to relax the guarantees.

Signed-off-by: Dave Hansen 
Cc: Wei Xu 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

Changes since 20200122:
 * Changelog material about relaxing cpuset constraints

Changes since 20210304:
 * Add Documentation/ material about relaxing cpuset constraints
---

 b/Documentation/admin-guide/sysctl/vm.rst |   12 
 b/include/linux/swap.h|3 ++-
 b/include/uapi/linux/mempolicy.h  |1 +
 b/mm/vmscan.c |6 --
 4 files changed, 19 insertions(+), 3 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-31 
15:17:40.324000190 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-31 15:17:40.349000190 
-0700
@@ -976,6 +976,7 @@ This is value OR'ed together of
 1  Zone reclaim on
 2  Zone reclaim writes dirty pages out
 4  Zone reclaim swaps pages
+8  Zone reclaim migrates pages
 =  ===
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -1000,3 +1001,14 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.  It may move data to a NUMA node that does not
+fall into the cpuset of the allocating process which might be construed
+to violate the guarantees of cpusets.  This should not be enabled on
+systems which need strict cpuset location guarantees.
diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
--- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-31 15:17:40.331000190 
-0700
+++ b/include/linux/swap.h  2021-03-31 15:17:40.351000190 -0700
@@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
 static inline bool node_reclaim_enabled(void)
 {
/* Is any node_reclaim_mode bit set? */
-   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
+   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
+   RECLAIM_UNMAP|RECLAIM_MIGRATE);
 }
 
 extern void check_move_unevictable_pages(struct pagevec *pvec);
diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-31 
15:17:40.337000190 -0700
+++ b/include/uapi/linux/mempolicy.h2021-03-31 15:17:40.352000190 -0700
@@ -71,5 +71,6 @@ enum {
 #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
 #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
 #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE(1<<3)  /* Migrate to other nodes during 
reclaim */
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
--- a/mm/vmscan.c~RECLAIM_MIGRATE   2021-03-31 15:17:40.339000190 -0700
+++ b/mm/vmscan.c   2021-03-31 15:17:40.357000190 -0700
@@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc
VM_BUG_ON_PAGE(PageHuge(page), page);

Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-03-09 Thread Dave Hansen
On 3/8/21 4:24 PM, Yang Shi wrote:
>> Once this is enabled page demotion may move data to a NUMA node
>> that does not fall into the cpuset of the allocating process.
>> This could be construed to violate the guarantees of cpusets.
>> However, since this is an opt-in mechanism, the assumption is
>> that anyone enabling it is content to relax the guarantees.
> I think we'd better have the cpuset violation paragraph along with new
> zone reclaim mode text so that the users are aware of the potential
> violation. I don't think commit log is the to-go place for any plain
> users.
> 

Agreed.  I'll add it to the Documentation/.


Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-03-08 Thread Yang Shi
On Thu, Mar 4, 2021 at 4:01 PM Dave Hansen  wrote:
>
>
> From: Dave Hansen 
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> Once this is enabled page demotion may move data to a NUMA node
> that does not fall into the cpuset of the allocating process.
> This could be construed to violate the guarantees of cpusets.
> However, since this is an opt-in mechanism, the assumption is
> that anyone enabling it is content to relax the guarantees.

I think we'd better have the cpuset violation paragraph along with new
zone reclaim mode text so that the users are aware of the potential
violation. I don't think commit log is the to-go place for any plain
users.

>
> Signed-off-by: Dave Hansen 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> changes since 20200122:
>  * Changelog material about relaxing cpuset constraints
> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |9 +
>  b/include/linux/swap.h|3 ++-
>  b/include/uapi/linux/mempolicy.h  |1 +
>  b/mm/vmscan.c |6 --
>  4 files changed, 16 insertions(+), 3 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
> Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-04 
> 15:36:26.078806355 -0800
> +++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-04 15:36:26.093806355 
> -0800
> @@ -976,6 +976,7 @@ This is value OR'ed together of
>  1  Zone reclaim on
>  2  Zone reclaim writes dirty pages out
>  4  Zone reclaim swaps pages
> +8  Zone reclaim migrates pages
>  =  ===
>
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -1000,3 +1001,11 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.
> diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
> --- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-04 15:36:26.082806355 
> -0800
> +++ b/include/linux/swap.h  2021-03-04 15:36:26.093806355 -0800
> @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
>  static inline bool node_reclaim_enabled(void)
>  {
> /* Is any node_reclaim_mode bit set? */
> -   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
> +   RECLAIM_UNMAP|RECLAIM_MIGRATE);
>  }
>
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
> include/uapi/linux/mempolicy.h
> --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-04 
> 15:36:26.084806355 -0800
> +++ b/include/uapi/linux/mempolicy.h2021-03-04 15:36:26.094806355 -0800
> @@ -69,5 +69,6 @@ enum {
>  #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
>  #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
>  #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
> +#define RECLAIM_MIGRATE(1<<3)  /* Migrate to other nodes during 
> reclaim */
>
>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
> diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
> --- a/mm/vmscan.c~RECLAIM_MIGRATE   2021-03-04 15:36:26.087806355 -0800
> +++ b/mm/vmscan.c   2021-03-04 15:36:26.096806355 -0800
> @@ -1073,6 

[PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

2021-03-04 Thread Dave Hansen


From: Dave Hansen 

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This proposes extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

Once this is enabled page demotion may move data to a NUMA node
that does not fall into the cpuset of the allocating process.
This could be construed to violate the guarantees of cpusets.
However, since this is an opt-in mechanism, the assumption is
that anyone enabling it is content to relax the guarantees.

Signed-off-by: Dave Hansen 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

changes since 20200122:
 * Changelog material about relaxing cpuset constraints
---

 b/Documentation/admin-guide/sysctl/vm.rst |9 +
 b/include/linux/swap.h|3 ++-
 b/include/uapi/linux/mempolicy.h  |1 +
 b/mm/vmscan.c |6 --
 4 files changed, 16 insertions(+), 3 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 
Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-04 
15:36:26.078806355 -0800
+++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-04 15:36:26.093806355 
-0800
@@ -976,6 +976,7 @@ This is value OR'ed together of
 1  Zone reclaim on
 2  Zone reclaim writes dirty pages out
 4  Zone reclaim swaps pages
+8  Zone reclaim migrates pages
 =  ===
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -1000,3 +1001,11 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.
diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
--- a/include/linux/swap.h~RECLAIM_MIGRATE  2021-03-04 15:36:26.082806355 
-0800
+++ b/include/linux/swap.h  2021-03-04 15:36:26.093806355 -0800
@@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
 static inline bool node_reclaim_enabled(void)
 {
/* Is any node_reclaim_mode bit set? */
-   return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
+   return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
+   RECLAIM_UNMAP|RECLAIM_MIGRATE);
 }
 
 extern void check_move_unevictable_pages(struct pagevec *pvec);
diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 
include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE2021-03-04 
15:36:26.084806355 -0800
+++ b/include/uapi/linux/mempolicy.h2021-03-04 15:36:26.094806355 -0800
@@ -69,5 +69,6 @@ enum {
 #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
 #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
 #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE(1<<3)  /* Migrate to other nodes during 
reclaim */
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
--- a/mm/vmscan.c~RECLAIM_MIGRATE   2021-03-04 15:36:26.087806355 -0800
+++ b/mm/vmscan.c   2021-03-04 15:36:26.096806355 -0800
@@ -1073,6 +1073,9 @@ static bool migrate_demote_page_ok(struc
VM_BUG_ON_PAGE(PageHuge(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);
 
+   if (!(node_reclaim_mode & RECLAIM_MIGRATE))
+   return false;
+
/* It is pointless to do demotion in memcg reclaim */
if (cgroup_reclaim(sc))
return false;
@@ -1082,8 +1085,7 @@ static bool migrate_demote_page_ok(struc
if (PageTransHuge(page) &&