Re: [ceph-users] fyi: Luminous 12.2.7 pulled wrong osd disk, resulted in node down
On Wed, Aug 1, 2018 at 10:38 PM, Marc Roos wrote: > > > Today we pulled the wrong disk from a ceph node. And that made the whole > node go down/be unresponsive. Even to a simple ping. I cannot find to > much about this in the log files. But I expect that the > /usr/bin/ceph-osd process caused a kernel panic. That would most likely be a kernel bug. Someone would probably need to look at a vmcore to work out what happened. > > Linux c01 3.10.0-693.11.1.el7.x86_64 > CentOS Linux release 7.4.1708 (Core) > libcephfs2-12.2.7-0.el7.x86_64 > ceph-mon-12.2.7-0.el7.x86_64 > nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 > ceph-selinux-12.2.7-0.el7.x86_64 > ceph-osd-12.2.7-0.el7.x86_64 > ceph-mgr-12.2.7-0.el7.x86_64 > ceph-12.2.7-0.el7.x86_64 > python-cephfs-12.2.7-0.el7.x86_64 > ceph-common-12.2.7-0.el7.x86_64 > ceph-mds-12.2.7-0.el7.x86_64 > ceph-radosgw-12.2.7-0.el7.x86_64 > ceph-base-12.2.7-0.el7.x86_64 > > Aug 1 11:01:01 c02 systemd: Started Session 8331 of user root. > Aug 1 11:01:01 c02 systemd: Starting Session 8331 of user root. > Aug 1 11:01:01 c02 systemd: Starting Session 8331 of user root. > Aug 1 11:03:08 c03 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) > called from line 1200 of file fs/xfs/xfs_log.c. Return address = > 0xc0232e60 > Aug 1 11:03:08 c03 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) > called from line 1200 of file fs/xfs/xfs_log.c. Return address = > 0xc0232e60 > Aug 1 11:03:33 c03 kernel: XFS (sdf1): xfs_do_force_shutdown(0x2) > called from line 1200 of file fs/xfs/xfs_log.c. Return address = > 0xc0232e60 > Aug 1 11:03:33 c03 kernel: XFS (sdf1): xfs_do_force_shutdown(0x2) > called from line 1200 of file fs/xfs/xfs_log.c. Return address = > 0xc0232e60 > Aug 1 11:03:34 c02 kernel: libceph: osd5 down > Aug 1 11:03:34 c02 kernel: libceph: osd5 down > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656719 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656719 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656746 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6812 osd.14 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656761 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6804 osd.15 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656773 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6814 osd.16 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656746 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6812 osd.14 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656761 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6804 osd.15 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656773 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6814 osd.16 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:44.656717) > Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657034 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:45.657031) > Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657034 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:45.657031) > Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657067 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6812 osd.14 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:45.657031) > Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657079 7f1f1e764700 -1 > osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6804 osd.15 > since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 > (cutoff 2018-08-01 11:04:45.657031) > Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657089
[ceph-users] fyi: Luminous 12.2.7 pulled wrong osd disk, resulted in node down
Today we pulled the wrong disk from a ceph node. And that made the whole node go down/be unresponsive. Even to a simple ping. I cannot find to much about this in the log files. But I expect that the /usr/bin/ceph-osd process caused a kernel panic. Linux c01 3.10.0-693.11.1.el7.x86_64 CentOS Linux release 7.4.1708 (Core) libcephfs2-12.2.7-0.el7.x86_64 ceph-mon-12.2.7-0.el7.x86_64 nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 ceph-selinux-12.2.7-0.el7.x86_64 ceph-osd-12.2.7-0.el7.x86_64 ceph-mgr-12.2.7-0.el7.x86_64 ceph-12.2.7-0.el7.x86_64 python-cephfs-12.2.7-0.el7.x86_64 ceph-common-12.2.7-0.el7.x86_64 ceph-mds-12.2.7-0.el7.x86_64 ceph-radosgw-12.2.7-0.el7.x86_64 ceph-base-12.2.7-0.el7.x86_64 Aug 1 11:01:01 c02 systemd: Started Session 8331 of user root. Aug 1 11:01:01 c02 systemd: Starting Session 8331 of user root. Aug 1 11:01:01 c02 systemd: Starting Session 8331 of user root. Aug 1 11:03:08 c03 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c. Return address = 0xc0232e60 Aug 1 11:03:08 c03 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c. Return address = 0xc0232e60 Aug 1 11:03:33 c03 kernel: XFS (sdf1): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c. Return address = 0xc0232e60 Aug 1 11:03:33 c03 kernel: XFS (sdf1): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c. Return address = 0xc0232e60 Aug 1 11:03:34 c02 kernel: libceph: osd5 down Aug 1 11:03:34 c02 kernel: libceph: osd5 down Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656719 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656719 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656746 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6812 osd.14 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656761 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6804 osd.15 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656773 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6814 osd.16 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656746 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6812 osd.14 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656761 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6804 osd.15 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:04 c02 ceph-osd: 2018-08-01 11:05:04.656773 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6814 osd.16 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:44.656717) Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657034 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:45.657031) Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657034 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6816 osd.12 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:45.657031) Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657067 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6812 osd.14 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:45.657031) Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657079 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6804 osd.15 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:45.657031) Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657089 7f1f1e764700 -1 osd.9 22452 heartbeat_check: no reply from 192.168.10.113:6814 osd.16 since back 2018-08-01 11:04:44.365869 front 2018-08-01 11:04:44.365869 (cutoff 2018-08-01 11:04:45.657031) Aug 1 11:05:05 c02 ceph-osd: 2018-08-01 11:05:05.657067 7f1f1e764700 -1 osd.9 22452 heartbeat_check: