H,

> On 30.03.21 г. 9:24, Wang Yugui wrote:
> > Hi, Nikolay Borisov
> > 
> > With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code,
> > we find out the call stack for ENOMEM.
> > see the file 0000-btrfs-dump_stack-when-ENOMEM.patch
> > 
> > 
> > #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg
> > ...
> > [ 5759.102929] ENOMEM btrfs_drew_lock_init
> > [ 5759.102943] ENOMEM btrfs_init_fs_root
> > [ 5759.102947] ------------[ cut here ]------------
> > [ 5759.102950] BTRFS: Transaction aborted (error -12)
> > [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at 
> > /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 
> > create_pending_snapshot+0xb8c/0xd50 [btrfs]
> > ...
> > 
> > 
> > btrfs_drew_lock_init() return -ENOMEM, 
> > this is the source:
> > 
> >     /*
> >      * We might be called under a transaction (e.g. indirect backref
> >      * resolution) which could deadlock if it triggers memory reclaim
> >      */
> >     nofs_flag = memalloc_nofs_save();
> >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> >     memalloc_nofs_restore(nofs_flag);
> >     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
> >     if (ret)
> >         goto fail;
> > 
> > And the souce come from:
> > 
> > commit dcc3eb9638c3c927f1597075e851d0a16300a876
> > Author: Nikolay Borisov <nbori...@suse.com>
> > Date:   Thu Jan 30 14:59:45 2020 +0200
> > 
> >     btrfs: convert snapshot/nocow exlcusion to drew lock
> > 
> > 
> > Any advice to fix this ENOMEM problem?
> 
> This is likely coming from changed behavior in MM, doesn't seem related
> to btrfs. We have multiple places where nofs_save() is called. By the
> same token the failure might have occurred in any other place, in any
> other piece of code which uses memalloc_nofs_save, there is no
> indication that this is directly related to btrfs.
> 
> > 
> > top command show that this server have engough memory.
> > 
> > The hardware of this server:
> > CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> > memory:  192G, no swap
> 
> You are showing that the server has 192G of installed memory, you have
> not shown any stats which prove at the time of failure what is the state
> of the MM subsystem. At the very least at the time of failure inspect
> the output of :
> 
> cat /proc/meminfo
> 
> and "free -m" commands.
> 
> <snip>

Only one xfstest job is running in this server.


Best Regards
Wang Yugui (wangyu...@e16-tech.com)
2021/03/30


Reply via email to