Re: [ceph-users] reproducible rbd-nbd crashes

Marc Schöchlin Tue, 10 Sep 2019 06:46:20 -0700

Hello Mike,

as described i set all the settings.


Unfortunately it crashed also with these settings :-(

Regards
Marc

[Tue Sep 10 12:25:56 2019] Btrfs loaded, crc32c=crc32c-intel
[Tue Sep 10 12:25:57 2019] EXT4-fs (dm-0): mounted filesystem with ordered data 
mode. Opts: (null)
[Tue Sep 10 12:25:59 2019] systemd[1]: systemd 237 running in system mode. 
(+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP 
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 
default-hierarchy=hybrid)
[Tue Sep 10 12:25:59 2019] systemd[1]: Detected virtualization xen.
[Tue Sep 10 12:25:59 2019] systemd[1]: Detected architecture x86-64.
[Tue Sep 10 12:25:59 2019] systemd[1]: Set hostname to <int-nfs-001>.
[Tue Sep 10 12:26:01 2019] systemd[1]: Started ntp-systemd-netif.path.
[Tue Sep 10 12:26:01 2019] systemd[1]: Created slice System Slice.
[Tue Sep 10 12:26:01 2019] systemd[1]: Listening on udev Kernel Socket.
[Tue Sep 10 12:26:01 2019] systemd[1]: Created slice 
system-serial\x2dgetty.slice.
[Tue Sep 10 12:26:01 2019] systemd[1]: Listening on Journal Socket.
[Tue Sep 10 12:26:01 2019] systemd[1]: Mounting POSIX Message Queue File 
System...
[Tue Sep 10 12:26:01 2019] RPC: Registered named UNIX socket transport module.
[Tue Sep 10 12:26:01 2019] RPC: Registered udp transport module.
[Tue Sep 10 12:26:01 2019] RPC: Registered tcp transport module.
[Tue Sep 10 12:26:01 2019] RPC: Registered tcp NFSv4.1 backchannel transport 
module.
[Tue Sep 10 12:26:01 2019] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
[Tue Sep 10 12:26:01 2019] Loading iSCSI transport class v2.0-870.
[Tue Sep 10 12:26:01 2019] iscsi: registered transport (tcp)
[Tue Sep 10 12:26:01 2019] systemd-journald[497]: Received request to flush 
runtime journal from PID 1
[Tue Sep 10 12:26:01 2019] Installing knfsd (copyright (C) 1996 
o...@monad.swb.de).
[Tue Sep 10 12:26:01 2019] iscsi: registered transport (iser)
[Tue Sep 10 12:26:01 2019] systemd-journald[497]: File 
/var/log/journal/cef15a6d1b80c9fbcb31a3a65aec21ad/system.journal corrupted or 
uncleanly shut down, renaming and replacing.
[Tue Sep 10 12:26:04 2019] EXT4-fs (dm-1): mounted filesystem with ordered data 
mode. Opts: (null)
[Tue Sep 10 12:26:05 2019] EXT4-fs (xvda1): mounted filesystem with ordered 
data mode. Opts: (null)
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.659:2): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/bin/lxc-start" pid=902 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:3): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/bin/man" pid=904 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:4): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="man_filter" pid=904 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.675:5): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="man_groff" pid=904 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:6): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:7): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default-cgns" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:8): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default-with-mounting" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.687:9): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default-with-nesting" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:10): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/lib/snapd/snap-confine" pid=905 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(1568111166.723:11): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=905 
comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] new mount options do not match the existing 
superblock, will be ignored
[Tue Sep 10 12:26:09 2019] SGI XFS with ACLs, security attributes, realtime, no 
debug enabled
[Tue Sep 10 12:26:09 2019] XFS (nbd0): Mounting V5 Filesystem
[Tue Sep 10 12:26:11 2019] XFS (nbd0): Starting recovery (logdev: internal)
[Tue Sep 10 12:26:12 2019] XFS (nbd0): Ending recovery (logdev: internal)
[Tue Sep 10 12:26:12 2019] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 
state recovery directory
[Tue Sep 10 12:26:12 2019] NFSD: starting 90-second grace period (net f00000f8)
[Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
[Tue Sep 10 14:45:04 2019] block nbd0: shutting down sockets
[Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
[Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
[Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 
89228968 flags 3
[Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 
121872688 flags 3
[Tue Sep 10 14:45:04 2019] print_req_error: I/O error, dev nbd0, sector 
123189160 flags 3
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 
4294967168 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 
4294967280 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 
4294967168 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 
4294967280 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 0 flags 0
[Tue Sep 10 14:45:52 2019] print_req_error: I/O error, dev nbd0, sector 8 flags 0
[Tue Sep 10 14:46:51 2019] INFO: task xfsaild/nbd0:1919 blocked for more than 
120 seconds.
[Tue Sep 10 14:46:51 2019]       Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu
[Tue Sep 10 14:46:51 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Tue Sep 10 14:46:51 2019] xfsaild/nbd0    D    0  1919      2 0x80000000
[Tue Sep 10 14:46:51 2019] Call Trace:
[Tue Sep 10 14:46:51 2019]  __schedule+0x2bd/0x850
[Tue Sep 10 14:46:51 2019]  ? try_to_del_timer_sync+0x53/0x80
[Tue Sep 10 14:46:51 2019]  schedule+0x2c/0x70
[Tue Sep 10 14:46:51 2019]  xfs_log_force+0x15f/0x2e0 [xfs]
[Tue Sep 10 14:46:51 2019]  ? wake_up_q+0x80/0x80
[Tue Sep 10 14:46:51 2019]  xfsaild+0x17b/0x800 [xfs]
[Tue Sep 10 14:46:51 2019]  ? __schedule+0x2c5/0x850
[Tue Sep 10 14:46:51 2019]  kthread+0x121/0x140
[Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[Tue Sep 10 14:46:51 2019]  ? kthread+0x121/0x140
[Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[Tue Sep 10 14:46:51 2019]  ? kthread_park+0x90/0x90
[Tue Sep 10 14:46:51 2019]  ret_from_fork+0x35/0x40

Am 10.09.19 um 12:51 schrieb Marc Schöchlin:
> Hello Mike,
>
> Am 03.09.19 um 04:41 schrieb Mike Christie:
>> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
>>> Hello Mike,
>>>
>>> i am having a quick look  to this on vacation because my coworker
>>> reports daily and continuous crashes ;-)
>>> Any updates here (i am aware that this is not very easy to fix)?
>> I am still working on it. It basically requires rbd-nbd to be written so
>> it preallocates its memory used for IO, and when it can't like when
>> doing network IO it requires adding a interface to tell the kernel to
>> not use allocation flags that can cause disk IO back on to the device.
>>
>> There are some workraounds like adding more memory and setting the vm
>> values. For the latter, if it seems if you set:
>>
>> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
>> because the kernel will immediately start to write dirty pages from the
>> background worker threads, so we do not end up later needing to write
>> out pages from the rbd-nbd thread to free up memory.
> Sigh, I set this yesterday on my system ("sysctl 
> vm.dirty_background_ratio=0") and got an additional crash this night :-(
>
> I now restarted the system and invoked all of the following commands 
> mentioned by your last mail:
>
> sysctl vm.dirty_background_ratio=0
> sysctl vm.dirty_ratio=0
> sysctl vm.vfs_cache_pressure=0
>
> Let's see if that helps....
>
> Regards
>
> Marc
>
>
> Am 03.09.19 um 04:41 schrieb Mike Christie:
>> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
>>> Hello Mike,
>>>
>>> i am having a quick look  to this on vacation because my coworker
>>> reports daily and continuous crashes ;-)
>>> Any updates here (i am aware that this is not very easy to fix)?
>> I am still working on it. It basically requires rbd-nbd to be written so
>> it preallocates its memory used for IO, and when it can't like when
>> doing network IO it requires adding a interface to tell the kernel to
>> not use allocation flags that can cause disk IO back on to the device.
>>
>> There are some workraounds like adding more memory and setting the vm
>> values. For the latter, if it seems if you set:
>>
>> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
>> because the kernel will immediately start to write dirty pages from the
>> background worker threads, so we do not end up later needing to write
>> out pages from the rbd-nbd thread to free up memory.
>>
>> or
>>
>> vm.dirty_ratio = 0 then it looks like it avoids the problem because the
>> kernel will just write out the data right away similar to above, but
>> from its normally going to be written out from the thread that you are
>> running your test from.
>>
>> and this seems optional and can result in other problems:
>>
>> vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid
>> one of the immediate problems where allocations would always cause the
>> inode caches to be reclaimed and that memory to be written out to the
>> device. For EXT4, I did not see a similar issue.
>>
>>> I think the severity of this problem
>>> <https://tracker.ceph.com/issues/40822> (currently "minor") is not
>>> suitable to the consequences of this problem.
>>>
>>> This reproducible problem can cause:
>>>
>>>   * random service outage
>>>   * data corruption
>>>   * long recovery procedures on huge filesystems
>>>
>>> Is it adequate to increase the severity to major or critical?
>>>
>>> What might the reason for a very reliable rbd-nbd running on my xen
>>> servers as storage repository?
>>> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 -
>>> hundreds of devices, high workload)
>>>
>>> Regards
>>> Marc
>>>
>>> Am 15.08.19 um 20:07 schrieb Marc Schöchlin:
>>>> Hello Mike,
>>>>
>>>> Am 15.08.19 um 19:57 schrieb Mike Christie:
>>>>>> Don't waste your time. I found a way to replicate it now.
>>>>>>
>>>>> Just a quick update.
>>>>>
>>>>> Looks like we are trying to allocate memory in the IO path in a way that
>>>>> can swing back on us, so we can end up locking up. You are probably not
>>>>> hitting this with krbd in your setup because normally it's preallocating
>>>>> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot
>>>>> preallocate some structs and cannot control the allocation flags for
>>>>> some operations initiated from userspace, so its possible to hit this
>>>>> every IO. I can replicate this now in a second just doing a cp -r.
>>>>>
>>>>> It's not going to be a simple fix. We have had a similar issue for
>>>>> storage daemons like iscsid and multipathd since they were created. It's
>>>>> less likey to hit with them because you only hit the paths they cannot
>>>>> control memory allocation behavior during recovery.
>>>>>
>>>>> I am looking into some things now.
>>>> Great to hear, that the problem is now identified.
>>>>
>>>> As described I'm on vacation -  if you need anything after the 8.9. we can 
>>>> probably invest some time to test upcoming fixes.
>>>>
>>>> Regards
>>>> Marc
>>>>
>>>>
>>> -- 
>>> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net
>>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] reproducible rbd-nbd crashes

Reply via email to