On 16/11/16 20:01, Vladimir Davydov wrote:
> Hello,
> 
> On Wed, Nov 16, 2016 at 10:44:59AM +1100, Balbir Singh wrote:
>> The lack of hotplug support makes us allocate all memory
>> upfront for per node data structures. With large number
>> of cgroups this can be an overhead. PPC64 actually limits
>> n_possible nodes to n_online to avoid some of this overhead.
>>
>> This patch adds the basic notifiers to listen to hotplug
>> events and does the allocation and free of those structures
>> per cgroup. We walk every cgroup per event, its a trade-off
>> of allocating upfront vs allocating on demand and freeing
>> on offline.
>>
>> Cc: Tejun Heo <t...@kernel.org>
>> Cc: Andrew Morton <a...@linux-foundation.org>
>> Cc: Johannes Weiner <han...@cmpxchg.org>
>> Cc: Michal Hocko <mho...@kernel.org> 
>> Cc: Vladimir Davydov <vdavydov....@gmail.com>
>>
>> Signed-off-by: Balbir Singh <bsinghar...@gmail.com>
>> ---
>>  mm/memcontrol.c | 68 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 60 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 91dfc7c..5585fce 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -63,6 +63,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/file.h>
>>  #include <linux/tracehook.h>
>> +#include <linux/memory.h>
>>  #include "internal.h"
>>  #include <net/sock.h>
>>  #include <net/ip.h>
>> @@ -1342,6 +1343,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup 
>> *memcg)
>>  {
>>      return 0;
>>  }
>> +
>> +static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
>> +{
>> +}
>>  #endif
>>  
>>  static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>> @@ -4115,14 +4120,7 @@ static int alloc_mem_cgroup_per_node_info(struct 
>> mem_cgroup *memcg, int node)
>>  {
>>      struct mem_cgroup_per_node *pn;
>>      int tmp = node;
>> -    /*
>> -     * This routine is called against possible nodes.
>> -     * But it's BUG to call kmalloc() against offline node.
>> -     *
>> -     * TODO: this routine can waste much memory for nodes which will
>> -     *       never be onlined. It's better to use memory hotplug callback
>> -     *       function.
>> -     */
>> +
>>      if (!node_state(node, N_NORMAL_MEMORY))
>>              tmp = -1;
>>      pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
>> @@ -5773,6 +5771,59 @@ static int __init cgroup_memory(char *s)
>>  }
>>  __setup("cgroup.memory=", cgroup_memory);
>>  
>> +static void memcg_node_offline(int node)
>> +{
>> +    struct mem_cgroup *memcg;
>> +
>> +    if (node < 0)
>> +            return;
> 
> Is this possible?

Yes, please see node_states_check_changes_online/offline

> 
>> +
>> +    for_each_mem_cgroup(memcg) {
>> +            free_mem_cgroup_per_node_info(memcg, node);
>> +            mem_cgroup_may_update_nodemask(memcg);
> 
> If memcg->numainfo_events is 0, mem_cgroup_may_update_nodemask() won't
> update memcg->scan_nodes. Is it OK?
> 
>> +    }
> 
> What if a memory cgroup is created or destroyed while you're walking the
> tree? Should we probably use get_online_mems() in mem_cgroup_alloc() to
> avoid that?
> 

The iterator internally takes rcu_read_lock() to avoid any side-effects
of cgroups added/removed. I suspect you are also suggesting using 
get_online_mems()
around each call to for_each_online_node

My understanding so far is

1. invalidate_reclaim_iterators should be safe (no bad side-effects)
2. mem_cgroup_free - should be safe as well
3. mem_cgroup_alloc - needs protection
4. mem_cgroup_init - needs protection
5. mem_cgroup_remove_from_tress - should be safe

>> +}
>> +
>> +static void memcg_node_online(int node)
>> +{
>> +    struct mem_cgroup *memcg;
>> +
>> +    if (node < 0)
>> +            return;
>> +
>> +    for_each_mem_cgroup(memcg) {
>> +            alloc_mem_cgroup_per_node_info(memcg, node);
>> +            mem_cgroup_may_update_nodemask(memcg);
>> +    }
>> +}
>> +
>> +static int memcg_memory_hotplug_callback(struct notifier_block *self,
>> +                                    unsigned long action, void *arg)
>> +{
>> +    struct memory_notify *marg = arg;
>> +    int node = marg->status_change_nid;
>> +
>> +    switch (action) {
>> +    case MEM_GOING_OFFLINE:
>> +    case MEM_CANCEL_ONLINE:
>> +            memcg_node_offline(node);
> 
> Judging by __offline_pages(), the MEM_GOING_OFFLINE event is emitted
> before migrating pages off the node. So, I guess freeing per-node info
> here isn't quite correct, as pages still need it to be moved from the
> node's LRU lists. Better move it to MEM_OFFLINE?
> 

Good point, will redo

>> +            break;
>> +    case MEM_GOING_ONLINE:
>> +    case MEM_CANCEL_OFFLINE:
>> +            memcg_node_online(node);
>> +            break;
>> +    case MEM_ONLINE:
>> +    case MEM_OFFLINE:
>> +            break;
>> +    }
>> +    return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block memcg_memory_hotplug_nb __meminitdata = {
>> +    .notifier_call = memcg_memory_hotplug_callback,
>> +    .priority = IPC_CALLBACK_PRI,
> 
> I wonder why you chose this priority?
> 

I just chose the lowest priority

>> +};
>> +
>>  /*
>>   * subsys_initcall() for memory controller.
>>   *
>> @@ -5797,6 +5848,7 @@ static int __init mem_cgroup_init(void)
>>  #endif
>>  
>>      hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>> +    register_hotmemory_notifier(&memcg_memory_hotplug_nb);
>>  
>>      for_each_possible_cpu(cpu)
>>              INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
> 
> I guess, we should modify mem_cgroup_alloc/free() in the scope of this
> patch, otherwise it doesn't make much sense IMHO. May be, it's even
> worth merging patches 1 and 2 altogether.
> 


Thanks for the review, I'll revisit the organization of the patches.


Balbir Singh

Reply via email to