On Wed, Jun 15, 2016 at 05:18:00PM +0200, Vlastimil Babka wrote:
> >@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
> >start)
> >
> >             arch_refresh_nodedata(nid, pgdat);
> >     } else {
> >-            /* Reset the nr_zones and classzone_idx to 0 before reuse */
> >+            /* Reset the nr_zones, order and classzone_idx before reuse */
> >             pgdat->nr_zones = 0;
> >-            pgdat->classzone_idx = 0;
> >+            pgdat->kswapd_order = 0;
> >+            pgdat->kswapd_classzone_idx = -1;
> >     }
> >
> >     /* we can use NODE_DATA(nid) from here */
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 4ce578b969da..d8cb483d5cad 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -6036,7 +6036,7 @@ void __paginginit free_area_init_node(int nid, 
> >unsigned long *zones_size,
> >     unsigned long end_pfn = 0;
> >
> >     /* pg_data_t should be reset to zero when it's allocated */
> >-    WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
> >+    WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
> 
> Above you changed the reset value of kswapd_classzone_idx from 0 to -1, so
> won't this trigger? Also should we check kswapd_order that's newly reset
> too?
> 

Good spot. The memory initialisation paths are ok but node memory hotplug
was broken.

> >
> >     reset_deferred_meminit(pgdat);
> >     pgdat->node_id = nid;
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 96bf841f9352..14b34eebedff 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -2727,7 +2727,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >
> >     /* kswapd must be awake if processes are being throttled */
> >     if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> >-            pgdat->classzone_idx = min(pgdat->classzone_idx,
> >+            pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
> >                                             (enum zone_type)ZONE_NORMAL);
> >             wake_up_interruptible(&pgdat->kswapd_wait);
> >     }
> >@@ -3211,6 +3211,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
> >order,
> >
> >     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >
> >+    /* If kswapd has not been woken recently, then full sleep */
> >+    if (classzone_idx == -1) {
> >+            classzone_idx = balanced_classzone_idx = MAX_NR_ZONES - 1;
> >+            goto full_sleep;
> 
> This will skip the wakeup_kcompactd() part.
> 

I wrestled with this one. I decided to leave it alone on the grounds
that if kswapd has not been woken recently then compaction efforts also
have not failed and kcompactd is not required.

> >@@ -3311,38 +3316,25 @@ static int kswapd(void *p)
> >     tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >     set_freezable();
> >
> >-    order = new_order = 0;
> >-    classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >-    balanced_classzone_idx = classzone_idx;
> >+    pgdat->kswapd_order = order = 0;
> >+    pgdat->kswapd_classzone_idx = classzone_idx = -1;
> >     for ( ; ; ) {
> >             bool ret;
> >
> >+kswapd_try_sleep:
> >+            kswapd_try_to_sleep(pgdat, order, classzone_idx, classzone_idx);
> 
> The last two parameters are now the same, remove one?
> 

Yes. A few more basic simplifications are then possible.

> >@@ -3352,12 +3344,19 @@ static int kswapd(void *p)
> >              * We can speed up thawing tasks if we don't call balance_pgdat
> >              * after returning from the refrigerator
> >              */
> >-            if (!ret) {
> >-                    trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >+            if (ret)
> >+                    continue;
> >
> >-                    /* return value ignored until next patch */
> >-                    balance_pgdat(pgdat, order, classzone_idx);
> >-            }
> >+            /*
> >+             * Try reclaim the requested order but if that fails
> >+             * then try sleeping on the basis of the order reclaimed.
> 
> Is the last word really meant to be "reclaimed", or "requested"?
> 

No, I really meant reclaimed.

> >+             */
> >+            trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >+            if (balance_pgdat(pgdat, order, classzone_idx) < order)
> >+                    goto kswapd_try_sleep;
> 
> AFAICS now kswapd_try_to_sleep() will use the "requested" order. That's
> needed for proper wakeup_kcompactd(), but won't it prevent kswapd from
> actually going to sleep, because zone_balanced() in prepare-sleep will be
> false? So I think you need to give it both orders to do the right thing?
> 

You're right. There is a risk that kswapd stays awake longer in high
fragmentation scenarios.

Should be fixed now by passing in both orders.

-- 
Mel Gorman
SUSE Labs

Reply via email to