There have been several similar reports on the mailing list about this [1][2][3][4] that are always a result of skipping step 6 from the Luminous upgrade guide [5]. The new (starting Luminous) 'profile rbd'-style caps are designed to try to simplify caps going forward [6].
TL;DR: your Openstack CephX users need to have permission to blacklist dead clients that failed to properly release the exclusive lock. [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022278.html [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022694.html [3] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026496.html [4] https://www.spinics.net/lists/ceph-users/msg45665.html [5] http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken [6] http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication On Fri, Jul 6, 2018 at 7:55 AM Gary Molenkamp <[email protected]> wrote: > Good morning all, > > After losing all power to our DC last night due to a storm, nearly all > of the volumes in our Pike cluster are unmountable. Of the 30 VMs in > use at the time, only one has been able to successfully mount and boot > from its rootfs. We are using Ceph as the backend storage to cinder > and glance. Any help or pointers to bring this back online would be > appreciated. > > What most of the volumes are seeing is > > [ 2.622252] SGI XFS with ACLs, security attributes, no debug enabled > [ 2.629285] XFS (sda1): Mounting V5 Filesystem > [ 2.832223] sd 2:0:0:0: [sda] FAILED Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 2.838412] sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] > [ 2.842383] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated > [ 2.846152] sd 2:0:0:0: [sda] CDB: Write(10) 2a 00 00 80 2c 19 00 04 > 00 00 > [ 2.850146] blk_update_request: I/O error, dev sda, sector 8399897 > > or > > [ 2.590178] EXT4-fs (vda1): INFO: recovery required on readonly > filesystem > [ 2.594319] EXT4-fs (vda1): write access will be enabled during recovery > [ 2.957742] print_req_error: I/O error, dev vda, sector 227328 > [ 2.962468] Buffer I/O error on dev vda1, logical block 0, lost async > page write > [ 2.967933] Buffer I/O error on dev vda1, logical block 1, lost async > page write > [ 2.973076] print_req_error: I/O error, dev vda, sector 229384 > > As a test for one of the less critical vms, I deleted the vm and mounted > the volume on the one VM I managed to start. The results were not > promising: > > > # dmesg |tail > [ 5.136862] type=1305 audit(1530847244.811:4): audit_pid=496 old=0 > auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 > [ 7.726331] nf_conntrack version 0.5.0 (65536 buckets, 262144 max) > [29374.967315] scsi 2:0:0:1: Direct-Access QEMU QEMU HARDDISK > 2.5+ PQ: 0 ANSI: 5 > [29374.988104] sd 2:0:0:1: [sdb] 83886080 512-byte logical blocks: (42.9 > GB/40.0 GiB) > [29374.991126] sd 2:0:0:1: Attached scsi generic sg1 type 0 > [29374.995302] sd 2:0:0:1: [sdb] Write Protect is off > [29374.997109] sd 2:0:0:1: [sdb] Mode Sense: 63 00 00 08 > [29374.997186] sd 2:0:0:1: [sdb] Write cache: enabled, read cache: > enabled, doesn't support DPO or FUA > [29375.005968] sdb: sdb1 > [29375.007746] sd 2:0:0:1: [sdb] Attached SCSI disk > > # parted /dev/sdb > GNU Parted 3.1 > Using /dev/sdb > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: QEMU QEMU HARDDISK (scsi) > Disk /dev/sdb: 42.9GB > Sector size (logical/physical): 512B/512B > Partition Table: msdos > Disk Flags: > > Number Start End Size Type File system Flags > 1 1049kB 42.9GB 42.9GB primary xfs boot > > # mount -t xfs /dev/sdb temp > mount: wrong fs type, bad option, bad superblock on /dev/sdb, > missing codepage or helper program, or other error > > In some cases useful info is found in syslog - try > dmesg | tail or so. > > # xfs_repair /dev/sdb > Phase 1 - find and verify superblock... > bad primary superblock - bad magic number !!! > > attempting to find secondary superblock... > > > > Which eventually fails. The ceph cluster looks healthy, I can export > the volumes from rbd. I can find no other errors in ceph of openstack > indicating a fault in either system. > > - Is this recoverable? > > - What happened to all of these volumes and can this be prevented > from occurring again? Note that any shutdown vm at the time of the > outage appears to be fine. > > > Relevant versions: > > Base OS: all Centos 7.5 > > Ceph: Luminous 12.2.5-0 > > Openstack: Latest Pike releases in centos-release-openstack-pike-1-1 > > nova 16.1.4-1 > > cinder 11.1.1-1 > > > > -- > Gary Molenkamp Computer Science/Science Technology > Services > Systems Administrator University of Western Ontario > [email protected] http://www.csd.uwo.ca > (519) 661-2111 x86882 (519) 661-3566 > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
