On Wed, Oct 25, 2017 at 10:15 AM Karun Josy <[email protected]> wrote:
> Hello everyone! :) > > I have an interesting problem. For a few weeks, we've been testing > Luminous in a cluster made up of 8 servers and with about 20 SSD disks > almost evenly distributed. It is running erasure coding. > > Yesterday, we decided to bring the cluster to a minimum of 8 servers and 1 > disk per server. > > So, we went ahead and removed the additional disks from the ceph cluster, > by executing commands like this from the admin server: > > ------------------- > $ ceph osd out osd.20 > osd.20 is already out. > $ ceph osd down osd.20 > marked down osd.20. > $ ceph osd purge osd.20 --yes-i-really-mean-it > Error EBUSY: osd.20 is not `down`. > These commands are removing the record of the OSD from the cluster point of view. But I don't see you executing *any* commands that would remove the OSD's local disk state? Whatever orchestration you used to install ceph (ceph-deploy, or chef/puppet/ansible scripts) should have options for that. So presumably they were turning on and trying to get themselves back in to the system, and you got yourself into trouble. -Greg > > So I logged in to the host it resides on and killed it systemctl stop > ceph-osd@26 > $ ceph osd purge osd.20 --yes-i-really-mean-it > purged osd.20 > -------- > > We waited for the cluster to be healthy once again and I physically > removed the disks (hot swap, connected to an LSI 3008 controller). A few > minutes after that, I needed to turn off one of the OSD servers to swap out > a piece of hardware inside. So, I issued: > > ceph osd set noout > > And proceeded to turn off that 1 OSD server. > > But the interesting thing happened then. Once that 1 server came back up, > the cluster all of a sudden showed that out of the 8 nodes, only 2 were up! > > 8 (2 up, 5 in) > > Even more interesting is that it seems Ceph, in each OSD server, still > thinks the missing disks are there! > > When I start ceph on each OSD server with "systemctl start ceph-osd.target", > /var/logs/ceph gets filled with logs for disks that are not supposed to > exist anymore. > > The contents of the logs show something like: > > # cat /var/log/ceph/ceph-osd.7.log > 2017-10-20 08:45:16.389432 7f8ee6e36d00 0 set uid:gid to 167:167 (ceph: > ceph) > 2017-10-20 08:45:16.389449 7f8ee6e36d00 0 ceph version 12.2.1 > (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process > (unknown), pid 2591 > 2017-10-20 08:45:16.389639 7f8ee6e36d00 -1 ** ERROR: unable to open OSD > superblock on /var/lib/ceph/osd/ceph-7: (2) No such file or directory > 2017-10-20 08:45:36.639439 7fb389277d00 0 set uid:gid to 167:167 (ceph: > ceph) > > The actual Ceph cluster sees only 8 disks, as you can see here: > > $ ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 7.97388 root default > -3 1.86469 host ceph-las1-a1-osd > 1 ssd 1.86469 osd.1 down 0 1.00000 > -5 0.87320 host ceph-las1-a2-osd > 2 ssd 0.87320 osd.2 down 0 1.00000 > -7 0.87320 host ceph-las1-a3-osd > 4 ssd 0.87320 osd.4 down 1.00000 1.00000 > -9 0.87320 host ceph-las1-a4-osd > 8 ssd 0.87320 osd.8 up 1.00000 1.00000 > -11 0.87320 host ceph-las1-a5-osd > 12 ssd 0.87320 osd.12 down 1.00000 1.00000 > -13 0.87320 host ceph-las1-a6-osd > 17 ssd 0.87320 osd.17 up 1.00000 1.00000 > -15 0.87320 host ceph-las1-a7-osd > 21 ssd 0.87320 osd.21 down 1.00000 1.00000 > -17 0.87000 host ceph-las1-a8-osd > 28 ssd 0.87000 osd.28 down 0 1.00000 > > > Linux, in the OSD servers, seems to also think the disks are in: > > # df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sde2 976M 183M 727M 21% /boot > /dev/sdd1 97M 5.4M 92M 6% /var/lib/ceph/osd/ceph-7 > /dev/sdc1 97M 5.4M 92M 6% /var/lib/ceph/osd/ceph-6 > /dev/sda1 97M 5.4M 92M 6% /var/lib/ceph/osd/ceph-4 > /dev/sdb1 97M 5.4M 92M 6% /var/lib/ceph/osd/ceph-5 > tmpfs 6.3G 0 6.3G 0% /run/user/0 > > It should show only one disk, not 4. > > I tried to issue again the commands to remove the disks, this time, in the > OSD server itself: > > $ ceph osd out osd.X > osd.X does not exist. > > $ ceph osd purge osd.X --yes-i-really-mean-it > osd.X does not exist > > Yet, if I again issue "systemctl start ceph-osd.target", /var/log/ceph again > shows logs for a disk that does not exist (to make sure, I deleted all logs > prior). > > So, it seems, somewhere, Ceph in the OSD still thinks there should be > more disks? > > The Ceph cluster is unusable though. We've tried everything to bring it > back again. But as Dr. Bones would say, it's dead Jim. > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
