Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-20 Thread Marc Schöchlin
Hello Mike and Jason,

as described in my last mail i converted the filesystem to ext4, set "sysctl 
vm.dirty_background_ratio=0" and I put the regular workload on the filesystem 
(used as a NFS mount).
That seems so to prevent crashes for a entire week now (before this, the nbd 
device crashed after hours/~one day).

XFS on top of nbd devices really seems to add additional instability situations.

The current workaround causes very high cpu load (40-50 on a 4 cpu virtual 
system) and up to ~95% iowait if a single client puts a 20GB File on that 
volume.

What is your current state in correcting this problem?
Can we support you in testing the by running tests with custom kernel- or 
rbd-nbd builds?

Regards
Marc

Am 13.09.19 um 14:15 schrieb Marc Schöchlin:
>>> Nevertheless i will try EXT4 on another system.
> I converted the filesystem to a ext4 filesystem.
>
> I completely deleted the entire rbd ec image and its snapshots (3) and 
> recreated it.
> After mapping and mounting i executed the following command:
>
> sysctl vm.dirty_background_ratio=0
>
> Lets see, what we get now
>
> Regards
> Marc
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-13 Thread Marc Schöchlin
Hello Jason,


Am 12.09.19 um 16:56 schrieb Jason Dillaman:
> On Thu, Sep 12, 2019 at 3:31 AM Marc Schöchlin  wrote:
>
> Whats that, have we seen that before? ("Numerical argument out of domain")
> It's the error that rbd-nbd prints when the kernel prematurely closes
> the socket ... and as we have already discussed, it's closing the
> socket due to the IO timeout being hit ... and it's hitting the IO
> timeout due to a deadlock due to memory pressure from rbd-nbd causing
> IO to pushed from the XFS cache back down into rbd-nbd.
Okay.
>
>> I can try that, but i am skeptical, i am note sure that we are searching on 
>> the right place...
>>
>> Why?
>> - we run hundreds of heavy use rbd-nbd instances in our xen dom-0 systems 
>> for 1.5 years now
>> - we never experienced problems like that in xen dom0 systems
>> - as described these instances run 12.2.5 ceph components with kernel 
>> 4.4.0+10
>> - the domU (virtual machines) are interacting heavily with that dom0 are 
>> using various filesystems
>>-> probably the architecture of the blktap components leads to different 
>> io scenario : https://wiki.xenproject.org/wiki/Blktap
> Are you running a XFS (or any) file system on top of the NBD block
> device in dom0? I suspect you are just passing raw block devices to
> the VMs and therefore they cannot see the same IO back pressure
> feedback loop.

No, we do not make directly use of a filesystem in dom0 on thet nbd device.

Our scenrio is:
The xen dom0 maps the NBD devices and connects them via tapdisk to the 
blktap/blkback infrastructure.
(https://wiki.xenproject.org/wiki/File:Blktap$blktap_diagram_differentSymbols.png,
 you can ignore the right upper quadrant of the diagram - tapdisk just maps the 
nbd device)
The blktap/blkback in xen dom0 infrastructure is using the device channel 
(shared memory ring) to communicate with the vm (domU) using the blkfrnt 
infrastructure an vice versa.
The device is exposed as a /dev/xvd device. These devices are used by our 
virtualized systems as raw devices for disks (using partitions) of for lvm.

I do not know the xen internals, but I suppose that this usage scenario leads 
to homogenous sizes of io requests because it seems to be difficult to 
implement a ringlist using shared memory
Probably a situation which reduces the probability of rbd-nbd crashes 
dramatically.

>> Nevertheless i will try EXT4 on another system.

I converted the filesystem to a ext4 filesystem.

I completely deleted the entire rbd ec image and its snapshots (3) and 
recreated it.
After mapping and mounting i executed the following command:

sysctl vm.dirty_background_ratio=0

Lets see, what we get now

Regards
Marc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-12 Thread Jason Dillaman
On Thu, Sep 12, 2019 at 3:31 AM Marc Schöchlin  wrote:
>
> Hello Jason,
>
> yesterday i started rbd-nbd in forground mode to see if there are any 
> additional informations.
>
> root@int-nfs-001:/etc/ceph# rbd-nbd map rbd_hdd/int-nfs-001_srv-ceph -d --id 
> nfs
> 2019-09-11 13:07:41.444534 77fe1040  0 ceph version 12.2.12 
> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process 
> rbd-nbd, pid 14735
> 2019-09-11 13:07:41.444555 77fe1040  0 pidfile_write: ignore empty 
> --pid-file
> /dev/nbd0
> -
>
>
> 2019-09-11 21:31:03.126223 7fffc3fff700 -1 rbd-nbd: failed to read nbd 
> request header: (33) Numerical argument out of domain
>
> Whats that, have we seen that before? ("Numerical argument out of domain")

It's the error that rbd-nbd prints when the kernel prematurely closes
the socket ... and as we have already discussed, it's closing the
socket due to the IO timeout being hit ... and it's hitting the IO
timeout due to a deadlock due to memory pressure from rbd-nbd causing
IO to pushed from the XFS cache back down into rbd-nbd.

> Am 10.09.19 um 16:10 schrieb Jason Dillaman:
> > [Tue Sep 10 14:46:51 2019]  ? __schedule+0x2c5/0x850
> > [Tue Sep 10 14:46:51 2019]  kthread+0x121/0x140
> > [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> > [Tue Sep 10 14:46:51 2019]  ? kthread+0x121/0x140
> > [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> > [Tue Sep 10 14:46:51 2019]  ? kthread_park+0x90/0x90
> > [Tue Sep 10 14:46:51 2019]  ret_from_fork+0x35/0x40
> > Perhaps try it w/ ext4 instead of XFS?
>
> I can try that, but i am skeptical, i am note sure that we are searching on 
> the right place...
>
> Why?
> - we run hundreds of heavy use rbd-nbd instances in our xen dom-0 systems for 
> 1.5 years now
> - we never experienced problems like that in xen dom0 systems
> - as described these instances run 12.2.5 ceph components with kernel 4.4.0+10
> - the domU (virtual machines) are interacting heavily with that dom0 are 
> using various filesystems
>-> probably the architecture of the blktap components leads to different 
> io scenario : https://wiki.xenproject.org/wiki/Blktap

Are you running a XFS (or any) file system on top of the NBD block
device in dom0? I suspect you are just passing raw block devices to
the VMs and therefore they cannot see the same IO back pressure
feedback loop.

> Nevertheless i will try EXT4 on another system.
>
> Regards
> Marc
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-12 Thread Marc Schöchlin
Hello Jason,

yesterday i started rbd-nbd in forground mode to see if there are any 
additional informations.

root@int-nfs-001:/etc/ceph# rbd-nbd map rbd_hdd/int-nfs-001_srv-ceph -d --id nfs
2019-09-11 13:07:41.444534 77fe1040  0 ceph version 12.2.12 
(1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process rbd-nbd, 
pid 14735
2019-09-11 13:07:41.444555 77fe1040  0 pidfile_write: ignore empty 
--pid-file
/dev/nbd0
-


2019-09-11 21:31:03.126223 7fffc3fff700 -1 rbd-nbd: failed to read nbd request 
header: (33) Numerical argument out of domain

Whats that, have we seen that before? ("Numerical argument out of domain")

Am 10.09.19 um 16:10 schrieb Jason Dillaman:
> [Tue Sep 10 14:46:51 2019]  ? __schedule+0x2c5/0x850
> [Tue Sep 10 14:46:51 2019]  kthread+0x121/0x140
> [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> [Tue Sep 10 14:46:51 2019]  ? kthread+0x121/0x140
> [Tue Sep 10 14:46:51 2019]  ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> [Tue Sep 10 14:46:51 2019]  ? kthread_park+0x90/0x90
> [Tue Sep 10 14:46:51 2019]  ret_from_fork+0x35/0x40
> Perhaps try it w/ ext4 instead of XFS?

I can try that, but i am skeptical, i am note sure that we are searching on the 
right place...

Why?
- we run hundreds of heavy use rbd-nbd instances in our xen dom-0 systems for 
1.5 years now
- we never experienced problems like that in xen dom0 systems
- as described these instances run 12.2.5 ceph components with kernel 4.4.0+10
- the domU (virtual machines) are interacting heavily with that dom0 are using 
various filesystems
   -> probably the architecture of the blktap components leads to different io 
scenario : https://wiki.xenproject.org/wiki/Blktap

Nevertheless i will try EXT4 on another system.

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-10 Thread Jason Dillaman
On Tue, Sep 10, 2019 at 9:46 AM Marc Schöchlin  wrote:
>
> Hello Mike,
>
> as described i set all the settings.
>
> Unfortunately it crashed also with these settings :-(
>
> Regards
> Marc
>
> [Tue Sep 10 12:25:56 2019] Btrfs loaded, crc32c=crc32c-intel
> [Tue Sep 10 12:25:57 2019] EXT4-fs (dm-0): mounted filesystem with ordered 
> data mode. Opts: (null)
> [Tue Sep 10 12:25:59 2019] systemd[1]: systemd 237 running in system mode. 
> (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP 
> +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN 
> -PCRE2 default-hierarchy=hybrid)
> [Tue Sep 10 12:25:59 2019] systemd[1]: Detected virtualization xen.
> [Tue Sep 10 12:25:59 2019] systemd[1]: Detected architecture x86-64.
> [Tue Sep 10 12:25:59 2019] systemd[1]: Set hostname to .
> [Tue Sep 10 12:26:01 2019] systemd[1]: Started ntp-systemd-netif.path.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice System Slice.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on udev Kernel Socket.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Created slice 
> system-serial\x2dgetty.slice.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Listening on Journal Socket.
> [Tue Sep 10 12:26:01 2019] systemd[1]: Mounting POSIX Message Queue File 
> System...
> [Tue Sep 10 12:26:01 2019] RPC: Registered named UNIX socket transport module.
> [Tue Sep 10 12:26:01 2019] RPC: Registered udp transport module.
> [Tue Sep 10 12:26:01 2019] RPC: Registered tcp transport module.
> [Tue Sep 10 12:26:01 2019] RPC: Registered tcp NFSv4.1 backchannel transport 
> module.
> [Tue Sep 10 12:26:01 2019] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
> [Tue Sep 10 12:26:01 2019] Loading iSCSI transport class v2.0-870.
> [Tue Sep 10 12:26:01 2019] iscsi: registered transport (tcp)
> [Tue Sep 10 12:26:01 2019] systemd-journald[497]: Received request to flush 
> runtime journal from PID 1
> [Tue Sep 10 12:26:01 2019] Installing knfsd (copyright (C) 1996 
> o...@monad.swb.de).
> [Tue Sep 10 12:26:01 2019] iscsi: registered transport (iser)
> [Tue Sep 10 12:26:01 2019] systemd-journald[497]: File 
> /var/log/journal/cef15a6d1b80c9fbcb31a3a65aec21ad/system.journal corrupted or 
> uncleanly shut down, renaming and replacing.
> [Tue Sep 10 12:26:04 2019] EXT4-fs (dm-1): mounted filesystem with ordered 
> data mode. Opts: (null)
> [Tue Sep 10 12:26:05 2019] EXT4-fs (xvda1): mounted filesystem with ordered 
> data mode. Opts: (null)
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.659:2): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="/usr/bin/lxc-start" pid=902 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.675:3): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="/usr/bin/man" pid=904 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.675:4): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="man_filter" pid=904 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.675:5): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="man_groff" pid=904 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:6): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="lxc-container-default" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:7): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="lxc-container-default-cgns" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:8): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="lxc-container-default-with-mounting" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:9): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="lxc-container-default-with-nesting" pid=900 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.723:10): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="/usr/lib/snapd/snap-confine" pid=905 comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.723:11): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=905 
> comm="apparmor_parser"
> [Tue Sep 10 12:26:06 2019] new mount options do not match the existing 
> superblock, will be ignored
> [Tue Sep 10 12:26:09 2019] SGI XFS with ACLs, security attributes, realtime, 
> no debug enabled
> [Tue Sep 10 12:26:09 2019] XFS (nbd0): Mounting V5 Filesystem
> [Tue Sep 10 12:26:11 2019] XFS (nbd0): Starting recovery (logdev: internal)
> [Tue Sep 10 12:26:12 2019] XFS (nbd0): Ending recovery (logdev: internal)
> [Tue Sep 10 12:26:12 2019] NFSD: Using 

Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-10 Thread Marc Schöchlin
Hello Mike,

as described i set all the settings.

Unfortunately it crashed also with these settings :-(

Regards
Marc

[Tue Sep 10 12:25:56 2019] Btrfs loaded, crc32c=crc32c-intel
[Tue Sep 10 12:25:57 2019] EXT4-fs (dm-0): mounted filesystem with ordered data 
mode. Opts: (null)
[Tue Sep 10 12:25:59 2019] systemd[1]: systemd 237 running in system mode. 
(+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP 
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 
default-hierarchy=hybrid)
[Tue Sep 10 12:25:59 2019] systemd[1]: Detected virtualization xen.
[Tue Sep 10 12:25:59 2019] systemd[1]: Detected architecture x86-64.
[Tue Sep 10 12:25:59 2019] systemd[1]: Set hostname to .
[Tue Sep 10 12:26:01 2019] systemd[1]: Started ntp-systemd-netif.path.
[Tue Sep 10 12:26:01 2019] systemd[1]: Created slice System Slice.
[Tue Sep 10 12:26:01 2019] systemd[1]: Listening on udev Kernel Socket.
[Tue Sep 10 12:26:01 2019] systemd[1]: Created slice 
system-serial\x2dgetty.slice.
[Tue Sep 10 12:26:01 2019] systemd[1]: Listening on Journal Socket.
[Tue Sep 10 12:26:01 2019] systemd[1]: Mounting POSIX Message Queue File 
System...
[Tue Sep 10 12:26:01 2019] RPC: Registered named UNIX socket transport module.
[Tue Sep 10 12:26:01 2019] RPC: Registered udp transport module.
[Tue Sep 10 12:26:01 2019] RPC: Registered tcp transport module.
[Tue Sep 10 12:26:01 2019] RPC: Registered tcp NFSv4.1 backchannel transport 
module.
[Tue Sep 10 12:26:01 2019] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
[Tue Sep 10 12:26:01 2019] Loading iSCSI transport class v2.0-870.
[Tue Sep 10 12:26:01 2019] iscsi: registered transport (tcp)
[Tue Sep 10 12:26:01 2019] systemd-journald[497]: Received request to flush 
runtime journal from PID 1
[Tue Sep 10 12:26:01 2019] Installing knfsd (copyright (C) 1996 
o...@monad.swb.de).
[Tue Sep 10 12:26:01 2019] iscsi: registered transport (iser)
[Tue Sep 10 12:26:01 2019] systemd-journald[497]: File 
/var/log/journal/cef15a6d1b80c9fbcb31a3a65aec21ad/system.journal corrupted or 
uncleanly shut down, renaming and replacing.
[Tue Sep 10 12:26:04 2019] EXT4-fs (dm-1): mounted filesystem with ordered data 
mode. Opts: (null)
[Tue Sep 10 12:26:05 2019] EXT4-fs (xvda1): mounted filesystem with ordered 
data mode. Opts: (null)
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.659:2): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/bin/lxc-start" pid=902 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.675:3): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/bin/man" pid=904 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.675:4): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="man_filter" pid=904 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.675:5): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="man_groff" pid=904 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:6): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:7): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default-cgns" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:8): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default-with-mounting" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.687:9): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="lxc-container-default-with-nesting" pid=900 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.723:10): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/lib/snapd/snap-confine" pid=905 comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] audit: type=1400 audit(156866.723:11): 
apparmor="STATUS" operation="profile_load" profile="unconfined" 
name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=905 
comm="apparmor_parser"
[Tue Sep 10 12:26:06 2019] new mount options do not match the existing 
superblock, will be ignored
[Tue Sep 10 12:26:09 2019] SGI XFS with ACLs, security attributes, realtime, no 
debug enabled
[Tue Sep 10 12:26:09 2019] XFS (nbd0): Mounting V5 Filesystem
[Tue Sep 10 12:26:11 2019] XFS (nbd0): Starting recovery (logdev: internal)
[Tue Sep 10 12:26:12 2019] XFS (nbd0): Ending recovery (logdev: internal)
[Tue Sep 10 12:26:12 2019] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 
state recovery directory
[Tue Sep 10 12:26:12 2019] NFSD: starting 90-second grace period (net f0f8)
[Tue Sep 10 14:45:04 2019] block nbd0: Connection timed out
[Tue Sep 10 14:45:04 2019] block nbd0: 

Re: [ceph-users] reproducible rbd-nbd crashes

2019-09-10 Thread Marc Schöchlin
Hello Mike,

Am 03.09.19 um 04:41 schrieb Mike Christie:
> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
>> Hello Mike,
>>
>> i am having a quick look  to this on vacation because my coworker
>> reports daily and continuous crashes ;-)
>> Any updates here (i am aware that this is not very easy to fix)?
> I am still working on it. It basically requires rbd-nbd to be written so
> it preallocates its memory used for IO, and when it can't like when
> doing network IO it requires adding a interface to tell the kernel to
> not use allocation flags that can cause disk IO back on to the device.
>
> There are some workraounds like adding more memory and setting the vm
> values. For the latter, if it seems if you set:
>
> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
> because the kernel will immediately start to write dirty pages from the
> background worker threads, so we do not end up later needing to write
> out pages from the rbd-nbd thread to free up memory.

Sigh, I set this yesterday on my system ("sysctl vm.dirty_background_ratio=0") 
and got an additional crash this night :-(

I now restarted the system and invoked all of the following commands mentioned 
by your last mail:

sysctl vm.dirty_background_ratio=0
sysctl vm.dirty_ratio=0
sysctl vm.vfs_cache_pressure=0

Let's see if that helps

Regards

Marc


Am 03.09.19 um 04:41 schrieb Mike Christie:
> On 09/02/2019 06:20 AM, Marc Schöchlin wrote:
>> Hello Mike,
>>
>> i am having a quick look  to this on vacation because my coworker
>> reports daily and continuous crashes ;-)
>> Any updates here (i am aware that this is not very easy to fix)?
> I am still working on it. It basically requires rbd-nbd to be written so
> it preallocates its memory used for IO, and when it can't like when
> doing network IO it requires adding a interface to tell the kernel to
> not use allocation flags that can cause disk IO back on to the device.
>
> There are some workraounds like adding more memory and setting the vm
> values. For the latter, if it seems if you set:
>
> vm.dirty_background_ratio = 0 then it looks like it avoids the problem
> because the kernel will immediately start to write dirty pages from the
> background worker threads, so we do not end up later needing to write
> out pages from the rbd-nbd thread to free up memory.
>
> or
>
> vm.dirty_ratio = 0 then it looks like it avoids the problem because the
> kernel will just write out the data right away similar to above, but
> from its normally going to be written out from the thread that you are
> running your test from.
>
> and this seems optional and can result in other problems:
>
> vm.vfs_cache_pressure = 0 then for at least XFS it looks like we avoid
> one of the immediate problems where allocations would always cause the
> inode caches to be reclaimed and that memory to be written out to the
> device. For EXT4, I did not see a similar issue.
>
>> I think the severity of this problem
>>  (currently "minor") is not
>> suitable to the consequences of this problem.
>>
>> This reproducible problem can cause:
>>
>>   * random service outage
>>   * data corruption
>>   * long recovery procedures on huge filesystems
>>
>> Is it adequate to increase the severity to major or critical?
>>
>> What might the reason for a very reliable rbd-nbd running on my xen
>> servers as storage repository?
>> (see https://github.com/vico-research-and-consulting/RBDSR/tree/v2.0 -
>> hundreds of devices, high workload)
>>
>> Regards
>> Marc
>>
>> Am 15.08.19 um 20:07 schrieb Marc Schöchlin:
>>> Hello Mike,
>>>
>>> Am 15.08.19 um 19:57 schrieb Mike Christie:
> Don't waste your time. I found a way to replicate it now.
>
 Just a quick update.

 Looks like we are trying to allocate memory in the IO path in a way that
 can swing back on us, so we can end up locking up. You are probably not
 hitting this with krbd in your setup because normally it's preallocating
 structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot
 preallocate some structs and cannot control the allocation flags for
 some operations initiated from userspace, so its possible to hit this
 every IO. I can replicate this now in a second just doing a cp -r.

 It's not going to be a simple fix. We have had a similar issue for
 storage daemons like iscsid and multipathd since they were created. It's
 less likey to hit with them because you only hit the paths they cannot
 control memory allocation behavior during recovery.

 I am looking into some things now.
>>> Great to hear, that the problem is now identified.
>>>
>>> As described I'm on vacation -  if you need anything after the 8.9. we can 
>>> probably invest some time to test upcoming fixes.
>>>
>>> Regards
>>> Marc
>>>
>>>
>> -- 
>> GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net
>>

___
ceph-users mailing list

Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-15 Thread Marc Schöchlin
Hello Mike,

Am 15.08.19 um 19:57 schrieb Mike Christie:
>
>> Don't waste your time. I found a way to replicate it now.
>>
>
> Just a quick update.
>
> Looks like we are trying to allocate memory in the IO path in a way that
> can swing back on us, so we can end up locking up. You are probably not
> hitting this with krbd in your setup because normally it's preallocating
> structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot
> preallocate some structs and cannot control the allocation flags for
> some operations initiated from userspace, so its possible to hit this
> every IO. I can replicate this now in a second just doing a cp -r.
>
> It's not going to be a simple fix. We have had a similar issue for
> storage daemons like iscsid and multipathd since they were created. It's
> less likey to hit with them because you only hit the paths they cannot
> control memory allocation behavior during recovery.
>
> I am looking into some things now.

Great to hear, that the problem is now identified.

As described I'm on vacation -  if you need anything after the 8.9. we can 
probably invest some time to test upcoming fixes.

Regards
Marc


-- 
GPG encryption available: 0x670DCBEC/pool.sks-keyservers.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-15 Thread Mike Christie
On 08/14/2019 06:55 PM, Mike Christie wrote:
> On 08/14/2019 02:09 PM, Mike Christie wrote:
>> On 08/14/2019 07:35 AM, Marc Schöchlin wrote:
> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
> He removed that code from the krbd. I will ping him on that.
>>>
>>> Interesting. I activated Coredumps for that processes - probably we can
>>> find something interesting here...
>>>
>>
>> Can you replicate the problem with timeout=0 on a 4.4 kernel (ceph
>> version does not matter as long as its known to hit the problem). When
>> you start to see IO hang and it gets jammed up can you do:
>>
>> dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
>>
>> and give me the waiting-tasks.txt so I can check if we are stuck in the
>> kernel waiting for memory.
> 
> Don't waste your time. I found a way to replicate it now.
> 

Just a quick update.

Looks like we are trying to allocate memory in the IO path in a way that
can swing back on us, so we can end up locking up. You are probably not
hitting this with krbd in your setup because normally it's preallocating
structs, using flags like GFP_NOIO, etc. For rbd-nbd, we cannot
preallocate some structs and cannot control the allocation flags for
some operations initiated from userspace, so its possible to hit this
every IO. I can replicate this now in a second just doing a cp -r.

It's not going to be a simple fix. We have had a similar issue for
storage daemons like iscsid and multipathd since they were created. It's
less likey to hit with them because you only hit the paths they cannot
control memory allocation behavior during recovery.

I am looking into some things now.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-14 Thread Mike Christie
On 08/14/2019 02:09 PM, Mike Christie wrote:
> On 08/14/2019 07:35 AM, Marc Schöchlin wrote:
 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
 He removed that code from the krbd. I will ping him on that.
>>
>> Interesting. I activated Coredumps for that processes - probably we can
>> find something interesting here...
>>
> 
> Can you replicate the problem with timeout=0 on a 4.4 kernel (ceph
> version does not matter as long as its known to hit the problem). When
> you start to see IO hang and it gets jammed up can you do:
> 
> dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt
> 
> and give me the waiting-tasks.txt so I can check if we are stuck in the
> kernel waiting for memory.

Don't waste your time. I found a way to replicate it now.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-14 Thread Mike Christie
On 08/14/2019 07:35 AM, Marc Schöchlin wrote:
>>> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
>>> He removed that code from the krbd. I will ping him on that.
> 
> Interesting. I activated Coredumps for that processes - probably we can
> find something interesting here...
> 

Can you replicate the problem with timeout=0 on a 4.4 kernel (ceph
version does not matter as long as its known to hit the problem). When
you start to see IO hang and it gets jammed up can you do:

dmesg -c; echo w >/proc/sysrq-trigger; dmesg -c >waiting-tasks.txt

and give me the waiting-tasks.txt so I can check if we are stuck in the
kernel waiting for memory.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reproducible rbd-nbd crashes

2019-08-14 Thread Marc Schöchlin
Hello Mike,

see my inline comments.

Am 14.08.19 um 02:09 schrieb Mike Christie:
>>> -
>>> Previous tests crashed in a reproducible manner with "-P 1" (single io 
>>> gzip/gunzip) after a few minutes up to 45 minutes.
>>>
>>> Overview of my tests:
>>>
>>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 
>>> 120s device timeout
>>>   -> 18 hour testrun was successful, no dmesg output
>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>>> device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created without reboot
>>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>>> while running the test
>>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>>> device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created
>>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>>> while running the test
>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no 
>>> timeout
>>>   -> failed after < 10 minutes
>>>   -> system runs in a high system load, system is almost unusable, unable 
>>> to shutdown the system, hard reset of vm necessary, manual exclusive lock 
>>> removal is necessary before remapping the device

There is something new compared to yesterday.three days ago i downgraded a 
production system to client version 12.2.5.
This night also this machine crashed. So it seems that rbd-nbd is broken in 
general also with release 12.2.5 and potentially before.

The new (updated) list:

*- FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file system, 120s 
device timeout**
**  -> crashed in production while snapshot trimming is running on that pool*
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created

>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
>>> 120s device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created
>> How many CPUs and how much memory does the VM have?

Charateristic of the crashed vm machine:

  * Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5
  * Services: NFS kernel Server, nothing else
  * Crash behavior:
  o daily Task for snapshot creation/deletion started at 19:00
  o a daily database backup started at 19:00, this created
  + 120 IOPS write, and 1 IOPS read
  + 22K/sectors per second write, 0 sectors/per second
  + 97 MBIT inbound and 97 MBIT outbound network usage (nfs server)
  o we had slow requests at the time of the crash
  o rbd-nbd process terminated 25min later without segfault
  o the nfs usage created a 5 min load of 10 from start, 5K context 
switches/sec
  o memory usage (kernel+userspace) was 10% of the system
  o no swap usage
  * ceph.conf
[client]
rbd cache = true
rbd cache size = 67108864
rbd cache max dirty = 33554432
rbd cache target dirty = 25165824
rbd cache max dirty age = 3
rbd readahead max bytes = 4194304
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
  * 4 CPUs
  * 6 GB RAM
  * Non default Sysctl Settings
vm.swappiness = 1
fs.aio-max-nr = 262144
fs.file-max = 100
kernel.pid_max = 4194303
vm.zone_reclaim_mode = 0
kernel.randomize_va_space = 0
kernel.panic = 0
kernel.panic_on_oops = 0


>> I'm not sure which test it covers above, but for
>> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
>> the command that probably triggered the timeout got stuck in safe_write
>> 

Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-13 Thread Mike Christie
On 08/13/2019 07:04 PM, Mike Christie wrote:
> On 07/31/2019 05:20 AM, Marc Schöchlin wrote:
>> Hello Jason,
>>
>> it seems that there is something wrong in the rbd-nbd implementation.
>> (added this information also at  https://tracker.ceph.com/issues/40822)
>>
>> The problem not seems to be related to kernel releases, filesystem types or 
>> the ceph and network setup.
>> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 
>> seems to have the described problem.
>>
>> This night a 18 hour testrun with the following procedure was successful:
>> -
>> #!/bin/bash
>> set -x
>> while true; do
>>date
>>find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 
>> 2 gzip -v
>>date
>>find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 
>> -n 2 gunzip -v
>> done
>> -
>> Previous tests crashed in a reproducible manner with "-P 1" (single io 
>> gzip/gunzip) after a few minutes up to 45 minutes.
>>
>> Overview of my tests:
>>
>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 
>> 120s device timeout
>>   -> 18 hour testrun was successful, no dmesg output
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>> device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>> errors, map/mount can be re-created without reboot
>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>> while running the test
>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>> device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>> errors, map/mount can be re-created
>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>> while running the test
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no 
>> timeout
>>   -> failed after < 10 minutes
>>   -> system runs in a high system load, system is almost unusable, unable to 
>> shutdown the system, hard reset of vm necessary, manual exclusive lock 
>> removal is necessary before remapping the device
> 
> Did you see Mykola's question on the tracker about this test? Did the
> system become unusable at 13:00?
> 
> Above you said it took less than 10 minutes, so we want to clarify if
> the test started at 12:39 and failed at 12:49 or if it started at 12:49
> and failed by 13:00.
> 
>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
>> 120s device timeout
>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>> errors, map/mount can be re-created
> 
> How many CPUs and how much memory does the VM have?
> 
> I'm not sure which test it covers above, but for
> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
> the command that probably triggered the timeout got stuck in safe_write
> or write_fd, because we see:
> 
> // Command completed and right after this log message we try to write
> the reply and data to the nbd.ko module.
> 
> 2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
> [4500 READ 24043755000~2 0]
> 
> // We got stuck and 2 minutes go by and so the timeout fires. That kills
> the socket, so we get an error here and after that rbd-nbd is going to exit.
> 
> 2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500
> READ 24043755000~2 0]: failed to write replay data: (32) Broken pipe
> 
> We could hit this in a couple ways:
> 
> 1. The block layer sends a command that is larger than the socket's send
> buffer limits. These are those values you sometimes set in sysctl.conf like:
> 
> net.core.rmem_max
> net.core.wmem_max
> net.core.rmem_default
> net.core.wmem_default
> net.core.optmem_max
> 
> There does not seem to be any checks/code to make sure there is some
> alignment with limits. I will send a patch but that will not help you
> right now. The max io size for nbd is 128k so make sure your net values
> are large enough. Increase the values in sysctl.conf and retry if they
> were too small.

Not sure what I was thinking. Just checked the logs and we have done IO
of the same size that got stuck and it was fine, so the socket sizes
should be ok.

We still need to add code to make sure IO sizes and the af_unix sockets
size limits match up.


> 
> 2. If memory is low on the system, we could be stuck trying to allocate
> memory in the kernel in that code path too.
> 
> rbd-nbd just uses more memory per device, so it could be why we do not
> see a problem with krbd.
> 
> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
> He removed that code from the krbd. I will ping him on that.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-13 Thread Mike Christie
On 07/31/2019 05:20 AM, Marc Schöchlin wrote:
> Hello Jason,
> 
> it seems that there is something wrong in the rbd-nbd implementation.
> (added this information also at  https://tracker.ceph.com/issues/40822)
> 
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.
> 
> This night a 18 hour testrun with the following procedure was successful:
> -
> #!/bin/bash
> set -x
> while true; do
>date
>find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
> gzip -v
>date
>find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 
> -n 2 gunzip -v
> done
> -
> Previous tests crashed in a reproducible manner with "-P 1" (single io 
> gzip/gunzip) after a few minutes up to 45 minutes.
> 
> Overview of my tests:
> 
> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
> device timeout
>   -> 18 hour testrun was successful, no dmesg output
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created without reboot
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
>   -> failed after < 10 minutes
>   -> system runs in a high system load, system is almost unusable, unable to 
> shutdown the system, hard reset of vm necessary, manual exclusive lock 
> removal is necessary before remapping the device

Did you see Mykola's question on the tracker about this test? Did the
system become unusable at 13:00?

Above you said it took less than 10 minutes, so we want to clarify if
the test started at 12:39 and failed at 12:49 or if it started at 12:49
and failed by 13:00.

> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
> 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created

How many CPUs and how much memory does the VM have?

I'm not sure which test it covers above, but for
test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
the command that probably triggered the timeout got stuck in safe_write
or write_fd, because we see:

// Command completed and right after this log message we try to write
the reply and data to the nbd.ko module.

2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
[4500 READ 24043755000~2 0]

// We got stuck and 2 minutes go by and so the timeout fires. That kills
the socket, so we get an error here and after that rbd-nbd is going to exit.

2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500
READ 24043755000~2 0]: failed to write replay data: (32) Broken pipe

We could hit this in a couple ways:

1. The block layer sends a command that is larger than the socket's send
buffer limits. These are those values you sometimes set in sysctl.conf like:

net.core.rmem_max
net.core.wmem_max
net.core.rmem_default
net.core.wmem_default
net.core.optmem_max

There does not seem to be any checks/code to make sure there is some
alignment with limits. I will send a patch but that will not help you
right now. The max io size for nbd is 128k so make sure your net values
are large enough. Increase the values in sysctl.conf and retry if they
were too small.

2. If memory is low on the system, we could be stuck trying to allocate
memory in the kernel in that code path too.

rbd-nbd just uses more memory per device, so it could be why we do not
see a problem with krbd.

3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
He removed that code from the krbd. I will ping him on that.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-08-13 Thread Marc Schöchlin
Hello Jason,

thanks for your response.
See my inline comments.

Am 31.07.19 um 14:43 schrieb Jason Dillaman:
> On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin  wrote:
>
>
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.
>  ...
>
> It's basically just a log message tweak and some changes to how the
> process is daemonized. If you could re-test w/ each release after
> 12.2.5 and pin-point where the issue starts occurring, we would have
> something more to investigate.

Are there changes related to https://tracker.ceph.com/issues/23891?


You showed me the very low amount of changes in rbd-nbd.
What about librbd, librados, ...?

What else can we do to find a detailed reason for the crash?
Do you think it is useful to activate coredump-creation for that process?

>> Whats next? Is i a good idea to do a binary search between 12.2.12 and 
>> 12.2.5?
>>
Due to the absence of a coworker i almost had no capacity to execute deeper 
tests with this problem.
But i can say that in reproduced the problem also with release 12.2.12.

The new (updated) list:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
*- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s 
device timeout-> failed after < 1 hour, rbd-nbd map/device is gone, mount 
throws io errors, map/mount can be re-created*

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-07-31 Thread Jason Dillaman
On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin  wrote:
>
> Hello Jason,
>
> it seems that there is something wrong in the rbd-nbd implementation.
> (added this information also at  https://tracker.ceph.com/issues/40822)
>
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.

Here is the complete delta between the two releases in rbd-nbd:

$ git diff v12.2.5..v12.2.12 -- .
diff --git a/src/tools/rbd_nbd/rbd-nbd.cc b/src/tools/rbd_nbd/rbd-nbd.cc
index 098d9925ca..aefdbd36e0 100644
--- a/src/tools/rbd_nbd/rbd-nbd.cc
+++ b/src/tools/rbd_nbd/rbd-nbd.cc
@@ -595,14 +595,13 @@ static int do_map(int argc, const char *argv[],
Config *cfg)
   cerr << err << std::endl;
   return r;
 }
-
 if (forker.is_parent()) {
-  global_init_postfork_start(g_ceph_context);
   if (forker.parent_wait(err) != 0) {
 return -ENXIO;
   }
   return 0;
 }
+global_init_postfork_start(g_ceph_context);
   }

   common_init_finish(g_ceph_context);
@@ -724,8 +723,8 @@ static int do_map(int argc, const char *argv[], Config *cfg)

   if (info.size > ULONG_MAX) {
 r = -EFBIG;
-cerr << "rbd-nbd: image is too large (" << prettybyte_t(info.size)
- << ", max is " << prettybyte_t(ULONG_MAX) << ")" << std::endl;
+cerr << "rbd-nbd: image is too large (" << byte_u_t(info.size)
+ << ", max is " << byte_u_t(ULONG_MAX) << ")" << std::endl;
 goto close_nbd;
   }

@@ -761,9 +760,8 @@ static int do_map(int argc, const char *argv[], Config *cfg)
 cout << cfg->devpath << std::endl;

 if (g_conf->daemonize) {
-  forker.daemonize();
-  global_init_postfork_start(g_ceph_context);
   global_init_postfork_finish(g_ceph_context);
+  forker.daemonize();
 }

 {

It's basically just a log message tweak and some changes to how the
process is daemonized. If you could re-test w/ each release after
12.2.5 and pin-point where the issue starts occurring, we would have
something more to investigate.

> This night a 18 hour testrun with the following procedure was successful:
> -
> #!/bin/bash
> set -x
> while true; do
>date
>find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
> gzip -v
>date
>find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 
> -n 2 gunzip -v
> done
> -
> Previous tests crashed in a reproducible manner with "-P 1" (single io 
> gzip/gunzip) after a few minutes up to 45 minutes.
>
> Overview of my tests:
>
> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
> device timeout
>   -> 18 hour testrun was successful, no dmesg output
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created without reboot
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
>   -> failed after < 10 minutes
>   -> system runs in a high system load, system is almost unusable, unable to 
> shutdown the system, hard reset of vm necessary, manual exclusive lock 
> removal is necessary before remapping the device
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
> 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created
>
> All device timeouts were set separately set by the nbd_set_ioctl tool because 
> luminous rbd-nbd does not provide the possibility to define timeouts.
>
> Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?
>
> From my point of view (without in depth-knowledge of rbd-nbd/librbd) my 
> assumption is that this problem might be caused by rbd-nbd code and not by 
> librbd.
> The probability that a bug like this survives uncovered in librbd for such a 
> long time seems to be low for me :-)
>
> Regards
> Marc
>
> Am 29.07.19 um 22:25 schrieb Marc Schöchlin:
> > Hello Jason,
> >
> > i updated the ticket https://tracker.ceph.com/issues/40822
> >
> > Am 24.07.19 um 19:20 schrieb Jason Dillaman:
> >> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
> >>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
> >>> because the ceph apt source does not contain that version.
> >>> Do you know a package source?
> >> All the upstream packages should be available here [1], 

Re: [ceph-users] reproducible rbd-nbd crashes

2019-07-31 Thread Marc Schöchlin
Hello Jason,

it seems that there is something wrong in the rbd-nbd implementation.
(added this information also at  https://tracker.ceph.com/issues/40822)

The problem not seems to be related to kernel releases, filesystem types or the 
ceph and network setup.
Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
to have the described problem.

This night a 18 hour testrun with the following procedure was successful:
-
#!/bin/bash
set -x
while true; do
   date
   find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
gzip -v
   date
   find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 
2 gunzip -v
done
-
Previous tests crashed in a reproducible manner with "-P 1" (single io 
gzip/gunzip) after a few minutes up to 45 minutes.

Overview of my tests:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created

All device timeouts were set separately set by the nbd_set_ioctl tool because 
luminous rbd-nbd does not provide the possibility to define timeouts.

Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?

From my point of view (without in depth-knowledge of rbd-nbd/librbd) my 
assumption is that this problem might be caused by rbd-nbd code and not by 
librbd.
The probability that a bug like this survives uncovered in librbd for such a 
long time seems to be low for me :-)

Regards
Marc

Am 29.07.19 um 22:25 schrieb Marc Schöchlin:
> Hello Jason,
>
> i updated the ticket https://tracker.ceph.com/issues/40822
>
> Am 24.07.19 um 19:20 schrieb Jason Dillaman:
>> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
>>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
>>> because the ceph apt source does not contain that version.
>>> Do you know a package source?
>> All the upstream packages should be available here [1], including 12.2.5.
> Ah okay, i will test this tommorow.
>> Did you pull the OSD blocked ops stats to figure out what is going on
>> with the OSDs?
> Yes, see referenced data in the ticket 
> https://tracker.ceph.com/issues/40822#note-15
>
> Regards
> Marc
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com