Hi,

This patch works.

 From:    Dennis Zhou <den...@kernel.org>
 To:      Wang Yugui <wangyu...@e16-tech.com>
 Cc:      Vlastimil Babka <vba...@suse.cz>,
          linux...@kvack.org,
          linux-btrfs@vger.kernel.org
 Date:    Thu, 8 Apr 2021 13:48:33 +0000
 Subject: Re: unexpected -ENOMEM from percpu_counter_init()
----

Ah. Can you try the following patch?
https://lore.kernel.org/lkml/20210408035736.883861-4-g...@fb.com/


Best Regards
Wang Yugui (wangyu...@e16-tech.com)
2021/04/13

> On 30.03.2021 09:16 Wang Yugui wrote:
> > H,
> >
> >> On 30.03.21 г. 9:24, Wang Yugui wrote:
> >>> Hi, Nikolay Borisov
> >>>
> >>> With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code,
> >>> we find out the call stack for ENOMEM.
> >>> see the file 0000-btrfs-dump_stack-when-ENOMEM.patch
> >>>
> >>>
> >>> #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg
> >>> ...
> >>> [ 5759.102929] ENOMEM btrfs_drew_lock_init
> >>> [ 5759.102943] ENOMEM btrfs_init_fs_root
> >>> [ 5759.102947] ------------[ cut here ]------------
> >>> [ 5759.102950] BTRFS: Transaction aborted (error -12)
> >>> [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at 
> >>> /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 
> >>> create_pending_snapshot+0xb8c/0xd50 [btrfs]
> >>> ...
> >>>
> >>>
> >>> btrfs_drew_lock_init() return -ENOMEM,
> >>> this is the source:
> >>>
> >>>      /*
> >>>       * We might be called under a transaction (e.g. indirect backref
> >>>       * resolution) which could deadlock if it triggers memory reclaim
> >>>       */
> >>>      nofs_flag = memalloc_nofs_save();
> >>>      ret = btrfs_drew_lock_init(&root->snapshot_lock);
> >>>      memalloc_nofs_restore(nofs_flag);
> >>>      if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
> >>>      if (ret)
> >>>          goto fail;
> >>>
> >>> And the souce come from:
> >>>
> >>> commit dcc3eb9638c3c927f1597075e851d0a16300a876
> >>> Author: Nikolay Borisov <nbori...@suse.com>
> >>> Date:   Thu Jan 30 14:59:45 2020 +0200
> >>>
> >>>      btrfs: convert snapshot/nocow exlcusion to drew lock
> >>>
> >>>
> >>> Any advice to fix this ENOMEM problem?
> >> This is likely coming from changed behavior in MM, doesn't seem related
> >> to btrfs. We have multiple places where nofs_save() is called. By the
> >> same token the failure might have occurred in any other place, in any
> >> other piece of code which uses memalloc_nofs_save, there is no
> >> indication that this is directly related to btrfs.
> >>
> >>> top command show that this server have engough memory.
> >>>
> >>> The hardware of this server:
> >>> CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> >>> memory:  192G, no swap
> >> You are showing that the server has 192G of installed memory, you have
> >> not shown any stats which prove at the time of failure what is the state
> >> of the MM subsystem. At the very least at the time of failure inspect
> >> the output of :
> >>
> >> cat /proc/meminfo
> >>
> >> and "free -m" commands.
> >>
> >> <snip>
> > Only one xfstest job is running in this server.
> 
> Had what looks like the same issue happinging on a server:
> 
> [19146.391015] ------------[ cut here ]------------
> [19146.391017] BTRFS: Transaction aborted (error -12)
> [19146.391035] WARNING: CPU: 13 PID: 1825871 at fs/btrfs/transaction.c:1684 
> create_pending_snapshot+0x912/0xd10
> [19146.391036] Modules linked in: bcache crc64 loop dm_crypt bfq xfs dm_mod 
> st sr_mod cdrom intel_powerclamp coretemp dcdbas kvm_intel snd_pcm snd_timer 
> kvm snd irqbypass soundcore mgag200 serio_raw pcspkr drm_kms_helper evdev 
> joydev iTCO_wdt iTCO_vendor_support i2c_algo_bit i7core_edac sg ipmi_si 
> ipmi_devintf ipmi_msghandler wmi acpi_power_meter button ib_iser rdma_cm 
> iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm 
> configfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov 
> async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod 
> sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel 
> ghash_clmulni_intel aesni_intel crypto_simd ahci cryptd glue_helper mpt3sas 
> libahci uhci_hcd ehci_pci psmouse ehci_hcd lpc_ich raid_class libata nvme 
> scsi_transport_sas mfd_core usbcore nvme_core scsi_mod t10_pi bnx2
> [19146.391092] CPU: 13 PID: 1825871 Comm: btrfs Tainted: G W I?????? 5.10.26 
> #1
> [19146.391093] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.14.0 
> 05/30/2018
> [19146.391095] RIP: 0010:create_pending_snapshot+0x912/0xd10
> [19146.391097] Code: 48 0f ba aa 40 0a 00 00 02 72 28 83 f8 fb 74 48 83 f8 e2 
> 74 43 89 c6 48 c7 c7 70 2d 10 82 48 89 85 78 ff ff ff e8 d5 65 55 00 <0f> 0b 
> 48 8b 85 78 ff ff ff 89 c1 ba 94 06 00 00 48 c7 c6 70 46 e4
> [19146.391098] RSP: 0018:ffffc900201c3b00 EFLAGS: 00010286
> [19146.391099] RAX: 0000000000000000 RBX: ffff8881ba393200 RCX: 
> ffff88880fb98b88
> [19146.391100] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: 
> ffff88880fb98b80
> [19146.391101] RBP: ffffc900201c3bd0 R08: ffffffff825e2148 R09: 
> 0000000000027ffb
> [19146.391101] R10: 00000000ffff8000 R11: 3fffffffffffffff R12: 
> ffff888119dd39c0
> [19146.391102] R13: ffff888248c36800 R14: ffff888a1bf69800 R15: 
> 00000000fffffff4
> [19146.391103] FS:? 00007f1d7c9488c0(0000) GS:ffff88880fb80000(0000) 
> knlGS:0000000000000000
> [19146.391104] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [19146.391105] CR2: 00007fffef58d000 CR3: 000000028c988004 CR4: 
> 00000000000206e0
> [19146.391106] Call Trace:
> [19146.391111]? ? create_pending_snapshots+0xa2/0xc0
> [19146.391112]? create_pending_snapshots+0xa2/0xc0
> [19146.391114]? btrfs_commit_transaction+0x4b9/0xb40
> [19146.391116]? ? start_transaction+0xd2/0x580
> [19146.391119]? btrfs_mksubvol+0x29e/0x450
> [19146.391122]? btrfs_mksnapshot+0x7b/0xb0
> [19146.391124]? __btrfs_ioctl_snap_create+0x16f/0x180
> [19146.391126]? btrfs_ioctl_snap_create_v2+0xb3/0x130
> [19146.391128]? btrfs_ioctl+0x15f/0x3040
> [19146.391131]? ? __x64_sys_ioctl+0x83/0xb0
> [19146.391132]? __x64_sys_ioctl+0x83/0xb0
> [19146.391136]? do_syscall_64+0x33/0x80
> [19146.391140]? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [19146.391142] RIP: 0033:0x7f1d7ca3fcc7
> [19146.391144] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 
> c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 
> 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
> [19146.391145] RSP: 002b:00007fffef5919b8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000010
> [19146.391146] RAX: ffffffffffffffda RBX: 00007fffef5a4c78 RCX: 
> 00007f1d7ca3fcc7
> [19146.391147] RDX: 00007fffef5919d0 RSI: 0000000050009417 RDI: 
> 0000000000000004
> [19146.391148] RBP: 0000565259b4d910 R08: ffffffffffffffff R09: 
> 0000565259b4d9e0
> [19146.391148] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000565259b4b8d0
> [19146.391149] R13: 000000000000000e R14: 0000565259b4d9e0 R15: 
> 00007fffef5919d0
> [19146.391151] ---[ end trace 3d3ae6fb9d3c0b49 ]---
> [19146.391153] BTRFS: error (device sdo2) in create_pending_snapshot:1684: 
> errno=-12 Out of memory
> [19146.391187] BTRFS info (device sdo2): forced readonly
> [19146.391190] BTRFS warning (device sdo2): Skipping commit of aborted 
> transaction.
> [19146.391191] BTRFS: error (device sdo2) in cleanup_transaction:1942: 
> errno=-12 Out of memory
> [44395.445834] BTRFS error (device sdo2): parent transid verify failed on 
> 280438898688 wanted 1423523 found 1423519
> [44395.448248] BTRFS error (device sdo2): parent transid verify failed on 
> 280438898688 wanted 1423523 found 1423519
> [44395.448512] BTRFS error (device sdo2): parent transid verify failed on 
> 280438898688 wanted 1423523 found 1423519
> [44395.455324] BTRFS error (device sdo2): parent transid verify failed on 
> 280438898688 wanted 1423523 found 1423519
> 


Reply via email to