On Tue, Jan 26, 2016 at 04:36:14PM +0100, Vlastimil Babka wrote:
> Memory compaction can be currently performed in several contexts:
> 
> - kswapd balancing a zone after a high-order allocation failure
> - direct compaction to satisfy a high-order allocation, including THP page
>   fault attemps
> - khugepaged trying to collapse a hugepage
> - manually from /proc
> 
> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
> (pending or future) high-order allocation, and is easy to evaluate. The other
> purpose is to keep overal memory fragmentation low and help the
> anti-fragmentation mechanism. The success wrt the latter purpose is more
> difficult to evaluate though.
> 
> The current situation wrt the purposes has a few drawbacks:
> 
> - compaction is invoked only when a high-order page or hugepage is not
>   available (or manually). This might be too late for the purposes of keeping
>   memory fragmentation low.
> - direct compaction increases latency of allocations. Again, it would be
>   better if compaction was performed asynchronously to keep fragmentation low,
>   before the allocation itself comes.
> - (a special case of the previous) the cost of compaction during THP page
>   faults can easily offset the benefits of THP.
> - kswapd compaction appears to be complex, fragile and not working in some
>   scenarios
> 

An addendum to that is that kswapd can be compacting for a high-order
allocation request when it should be reclaiming memory for an order-0
request.

My recollection is that kswapd compacting was meant to help atomic
high-order allocations but I wonder if the same problem even exists with
the revised watermark handling.

> To improve the situation, we should be able to benefit from an equivalent of
> kswapd, but for compaction - i.e. a background thread which responds to
> fragmentation and the need for high-order allocations (including hugepages)
> somewhat proactively.
> 
> One possibility is to extend the responsibilities of kswapd, which could
> however complicate its design too much. It should be better to let kswapd
> handle reclaim, as order-0 allocations are often more critical than high-order
> ones.
> 
> Another possibility is to extend khugepaged, but this kthread is a single
> instance and tied to THP configs.
> 

That also would not handle the atomic high-order allocation case.

> This patch goes with the option of a new set of per-node kthreads called
> kcompactd, and lays the foundations, without introducing any new tunables.
> The lifecycle mimics kswapd kthreads, including the memory hotplug hooks.
> 
> Waking up of the kcompactd threads is also tied to kswapd activity and follows
> these rules:
> - we don't want to affect any fastpaths, so wake up kcompactd only from the
>   slowpath, as it's done for kswapd

Ok

> - if kswapd is doing reclaim, it's more important than compaction, so don't
>   invoke kcompactd until kswapd goes to sleep

This makes sense given that kswapd can be reclaiming order-0 pages so
compaction can even start with a reasonable chance of success.

> - the target order used for kswapd is passed to kcompactd
> 
> The kswapd compact/reclaim loop for high-order pages will be removed in the
> next patch with the description of what's wrong with it.
> 
> In this patch, kcompactd uses the standard compaction_suitable() and
> compact_finished() criteria, which means it will most likely have nothing left
> to do after kswapd finishes, until the next patch. Kcompactd also mimics
> direct compaction somewhat by trying async compaction first and sync 
> compaction
> afterwards, and uses the deferred compaction functionality.
> 

Why should it try async compaction first? The deferred compaction makes
sense as kcompact will need some sort of limitation on the amount of
CPU it can use.

> Future possible future uses for kcompactd include the ability to wake up
> kcompactd on demand in special situations, such as when hugepages are not
> available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation
> event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform
> periodic compaction with kcompactd.
> 
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
>  include/linux/compaction.h        |  16 +++
>  include/linux/mmzone.h            |   6 +
>  include/linux/vm_event_item.h     |   1 +
>  include/trace/events/compaction.h |  55 +++++++++
>  mm/compaction.c                   | 227 
> ++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c               |  15 ++-
>  mm/page_alloc.c                   |   3 +
>  mm/vmscan.c                       |   6 +
>  mm/vmstat.c                       |   1 +
>  9 files changed, 325 insertions(+), 5 deletions(-)
> 
> <SNIP>
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 585de54dbe8c..7452975fa481 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -17,6 +17,9 @@
>  #include <linux/balloon_compaction.h>
>  #include <linux/page-isolation.h>
>  #include <linux/kasan.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> +#include <linux/module.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_COMPACTION
> @@ -29,6 +32,7 @@ static inline void count_compact_events(enum vm_event_item 
> item, long delta)
>  {
>       count_vm_events(item, delta);
>  }
> +
>  #else
>  #define count_compact_event(item) do { } while (0)
>  #define count_compact_events(item, delta) do { } while (0)

Spurious whitespace change.

> @@ -1759,4 +1763,227 @@ void compaction_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>  
> +static bool kcompactd_work_requested(pg_data_t *pgdat)
> +{
> +     return pgdat->kcompactd_max_order > 0;
> +}
> +

inline

> +static bool kcompactd_node_suitable(pg_data_t *pgdat)
> +{
> +     int zoneid;
> +     struct zone *zone;
> +
> +     for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> +             zone = &pgdat->node_zones[zoneid];
> +
> +             if (!populated_zone(zone))
> +                     continue;
> +
> +             if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
> +                                     pgdat->kcompactd_classzone_idx)
> +                                                     == COMPACT_CONTINUE)
> +                     return true;
> +     }
> +
> +     return false;
> +}
> +

Why does this traverse all zones and not just the ones within the
classzone_idx?

> +static void kcompactd_do_work(pg_data_t *pgdat)
> +{
> +     /*
> +      * With no special task, compact all zones so that a page of requested
> +      * order is allocatable.
> +      */
> +     int zoneid;
> +     struct zone *zone;
> +     struct compact_control cc = {
> +             .order = pgdat->kcompactd_max_order,
> +             .classzone_idx = pgdat->kcompactd_classzone_idx,
> +             .mode = MIGRATE_ASYNC,
> +             .ignore_skip_hint = true,
> +
> +     };
> +     bool success = false;
> +
> +     trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
> +                                                     cc.classzone_idx);
> +     count_vm_event(KCOMPACTD_WAKE);
> +
> +retry:
> +     for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {

Again, why is classzone_idx not taken into account?

> +             int status;
> +
> +             zone = &pgdat->node_zones[zoneid];
> +             if (!populated_zone(zone))
> +                     continue;
> +
> +             if (compaction_deferred(zone, cc.order))
> +                     continue;
> +
> +             if (compaction_suitable(zone, cc.order, 0, zoneid) !=
> +                                                     COMPACT_CONTINUE)
> +                     continue;
> +
> +             cc.nr_freepages = 0;
> +             cc.nr_migratepages = 0;
> +             cc.zone = zone;
> +             INIT_LIST_HEAD(&cc.freepages);
> +             INIT_LIST_HEAD(&cc.migratepages);
> +
> +             status = compact_zone(zone, &cc);
> +
> +             if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone),
> +                                             cc.classzone_idx, 0)) {
> +                     success = true;
> +                     compaction_defer_reset(zone, cc.order, false);
> +             } else if (cc.mode != MIGRATE_ASYNC &&
> +                                             status == COMPACT_COMPLETE) {
> +                     defer_compaction(zone, cc.order);
> +             }
> +
> +             VM_BUG_ON(!list_empty(&cc.freepages));
> +             VM_BUG_ON(!list_empty(&cc.migratepages));
> +     }
> +
> +     if (!success && cc.mode == MIGRATE_ASYNC) {
> +             cc.mode = MIGRATE_SYNC_LIGHT;
> +             goto retry;
> +     }
> +

Still not getting why kcompactd should concern itself with async
compaction. It's really direct compaction that cared and was trying to
avoid stalls.

> +      * Regardless of success, we are done until woken up next. But remember
> +      * the requested order/classzone_idx in case it was higher/tighter than
> +      * our current ones
> +      */
> +     if (pgdat->kcompactd_max_order <= cc.order)
> +             pgdat->kcompactd_max_order = 0;
> +     if (pgdat->classzone_idx >= cc.classzone_idx)
> +             pgdat->classzone_idx = pgdat->nr_zones - 1;
> +}
> +
>
> <SNIP>
>
> @@ -1042,7 +1043,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>       arg.nr_pages = nr_pages;
>       node_states_check_changes_online(nr_pages, zone, &arg);
>  
> -     nid = pfn_to_nid(pfn);
> +     nid = zone_to_nid(zone);
>  
>       ret = memory_notify(MEM_GOING_ONLINE, &arg);
>       ret = notifier_to_errno(ret);
> @@ -1082,7 +1083,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>       pgdat_resize_unlock(zone->zone_pgdat, &flags);
>  
>       if (onlined_pages) {
> -             node_states_set_node(zone_to_nid(zone), &arg);
> +             node_states_set_node(nid, &arg);
>               if (need_zonelists_rebuild)
>                       build_all_zonelists(NULL, NULL);
>               else

Why are these two hunks necessary?

> @@ -1093,8 +1094,10 @@ int __ref online_pages(unsigned long pfn, unsigned 
> long nr_pages, int online_typ
>  
>       init_per_zone_wmark_min();
>  
> -     if (onlined_pages)
> -             kswapd_run(zone_to_nid(zone));
> +     if (onlined_pages) {
> +             kswapd_run(nid);
> +             kcompactd_run(nid);
> +     }
>  
>       vm_total_pages = nr_free_pagecache_pages();
>  
> @@ -1858,8 +1861,10 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>               zone_pcp_update(zone);
>  
>       node_states_clear_node(node, &arg);
> -     if (arg.status_change_nid >= 0)
> +     if (arg.status_change_nid >= 0) {
>               kswapd_stop(node);
> +             kcompactd_stop(node);
> +     }
>  
>       vm_total_pages = nr_free_pagecache_pages();
>       writeback_set_ratelimit();
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 63358d9f9aa9..7747eb36e789 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5212,6 +5212,9 @@ static void __paginginit free_area_init_core(struct 
> pglist_data *pgdat)
>  #endif
>       init_waitqueue_head(&pgdat->kswapd_wait);
>       init_waitqueue_head(&pgdat->pfmemalloc_wait);
> +#ifdef CONFIG_COMPACTION
> +     init_waitqueue_head(&pgdat->kcompactd_wait);
> +#endif
>       pgdat_page_ext_init(pgdat);
>  
>       for (j = 0; j < MAX_NR_ZONES; j++) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 72d52d3aef74..1449e21c55cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3408,6 +3408,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> order, int classzone_idx)
>                */
>               reset_isolation_suitable(pgdat);
>  
> +             /*
> +              * We have freed the memory, now we should compact it to make
> +              * allocation of the requested order possible.
> +              */
> +             wakeup_kcompactd(pgdat, order, classzone_idx);
> +
>               if (!kthread_should_stop())
>                       schedule();
>  

This initially confused me but it's due to patch ordering. It's silly
but when this patch is applied then both kswapd and kcompactd are
compacting memory. I would prefer if the patches were in reverse order
but that is purely taste.

While this was not a comprehensive review, I think the patch is ok in
principal. While deferred compaction will keep the CPU usage under control,
the main concern is that kcompactd consumes too much CPU but I do not
see a case where that would trigger that kswapd would not have
encountered already.

-- 
Mel Gorman
SUSE Labs

Reply via email to