This is going to sound odd and if I hadn't been issuing all commands on the
monitor I would swear I issued 'rm -rf' from the shell of the osd in the
/var/lib/osd/ceph-s/ directory. After creating the pool/rbd and getting an
error from 'rbd info' I saw an osd down/out so I went to it's shell and the
ceph-osd daemon code is gone. I'll assume I erased it, but how do I recover
this cluster without doing a purge/purgedata reinstall?
I bought up a new cluster. All pages are 'active+clean' and all 3 OSD's are
UP/IN.
[root@essperf3 Ceph]# ceph -s
cluster 32c48975-bb57-47f6-8138-e152452e3bbe
health HEALTH_OK
monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1,
quorum 0 essperf3
osdmap e8: 3 osds: 3 up, 3 in
pgmap v13: 192 pgs, 3 pools, 0 bytes data, 0 objects
10106 MB used, 1148 GB / 1158 GB avail
192 active+clean
[root@essperf3 Ceph]# ceph osd tree
# id weight type name up/down reweight
-1 1.13 root default
-2 0.45 host ess51
0 0.45 osd.0 up
1
-3 0.23 host ess52
1 0.23 osd.1 up
1
-4 0.45 host ess59
2 0.45 osd.2 up
1
[root@essperf3 Ceph]#
Next created a test pool and a 1GB rbd and listed it
[root@essperf3 Ceph]# ceph osd pool create testpool 75 75
pool 'testpool' created
[root@essperf3 Ceph]# ceph osd lspools
0 data,1 metadata,2 rbd,3 testpool,
[root@essperf3 Ceph]# rbd create testimage --size 1024 --pool testpool
[root@essperf3 Ceph]# rbd ls testpool
testimage
[root@essperf3 Ceph]#
When I look at the 'info' output I start seeing problems.
[root@essperf3 Ceph]# rbd --image testimage info
rbd: error opening image testimage: (2) No such file or directory2014-08-04
18:39:33.602263 7fc4b9e80760 -1 librbd::ImageCtx: error finding header: (2) No
such file or directory
[root@essperf3 Ceph]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
693G 683G 10073M 1.42
POOLS:
NAME ID USED %USED OBJECTS
data 0 0 0 0
metadata 1 0 0 0
rbd 2 0 0 0
testpool 3 137 0 2
[root@essperf3 Ceph]# ceph -s
cluster 32c48975-bb57-47f6-8138-e152452e3bbe
health HEALTH_WARN 267 pgs degraded; 100 pgs stuck unclean; recovery 2/6
objects degraded (33.333%)
monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1,
quorum 0 essperf3
osdmap e21: 3 osds: 2 up, 2 in
pgmap v48: 267 pgs, 4 pools, 137 bytes data, 2 objects
10073 MB used, 683 GB / 693 GB avail
2/6 objects degraded (33.333%)
267 active+degraded
client io 17 B/s rd, 0 op/s
[root@essperf3 Ceph]#
Check to see which OSD is down:
[root@essperf3 Ceph]# ceph osd tree
# id weight type name up/down reweight
-1 1.13 root default
-2 0.45 host ess51
0 0.45 osd.0 up
1
-3 0.23 host ess52
1 0.23 osd.1 up
1
-4 0.45 host ess59
2 0.45 osd.2 down 0
[root@essperf3 Ceph]#
Then go to the shell on ess59: and restart the osd: (This is where it gets
rather odd) My ceph.conf has
debug osd = 20
debug ms = 1
and I expect to see output from the /etc/init.d/ceph restart osd and I see
nothing. With a little digging I see that the /var/lib/ceph/osd/ceph-2/
directory is EMPTY. There is no ceph-osd daemon. It's almost like I did a 'rm
-rf ' on that directory from the shell of ess59/osd.2 yet all commands have
been executed on the monitor.
[root@ess59 ceph]# ip addr | grep .59
inet 10.10.40.59/24 brd 10.10.40.255 scope global em1
inet6 fe80::92b1:1cff:fe18:659f/64 scope link
inet 209.243.160.59/24 brd 209.243.160.255 scope global em2
inet 10.10.50.59/24 brd 10.10.50.255 scope global p6p2
[root@ess59 ceph]# ll /var/lib/ceph/osd/
total 4
drwxr-xr-x 2 root root 4096 Aug 4 14:46 ceph-2
[root@ess59 ceph]# ll /var/lib/ceph/
total 24
drwxr-xr-x 2 root root 4096 Jul 29 18:36 bootstrap-mds
drwxr-xr-x 2 root root 4096 Aug 4 14:23 bootstrap-osd
drwxr-xr-x 2 root root 4096 Jul 29 18:36 mds
drwxr-xr-x 2 root root 4096 Jul 29 18:36 mon
drwxr-xr-x 3 root root 4096 Aug 4 14:46 osd
drwxr-xr-x 2 root root 4096 Aug 4 18:14 tmp
[root@ess59 ceph]# ll /var/lib/ceph/osd/ceph-2/
total 0
[root@ess59 ceph]#
Looking at the monitor logs I see osd.2 boot and even see where osd.2 leaves
the cluster, but how do I lose the daemon.
How do I recover/repair the OSD without having to reinstall the cluster ....
Again?
2014-08-04 14:47:10.008426 mon.0 [INF] pgmap v13: 192 pgs: 192 active+clean; 0
bytes data, 10106 MB used, 1148 GB / 1158 GB avail
2014-08-04 14:49:08.854988 mon.0 [INF] pgmap v14: 192 pgs: 192 active+clean; 0
bytes data, 10106 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:38:55.529118 mon.0 [INF] osdmap e9: 3 osds: 3 up, 3 in
2014-08-04 16:38:55.588920 mon.0 [INF] pgmap v15: 267 pgs: 75 creating, 192
active+clean; 0 bytes data, 10106 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:38:56.674507 mon.0 [INF] osdmap e10: 3 osds: 3 up, 3 in
2014-08-04 16:38:56.707256 mon.0 [INF] pgmap v16: 267 pgs: 75 creating, 192
active+clean; 0 bytes data, 10106 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:39:01.182508 mon.0 [INF] pgmap v17: 267 pgs: 56 creating, 2
peering, 209 active+clean; 0 bytes data, 10107 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:39:02.265569 mon.0 [INF] pgmap v18: 267 pgs: 2 inactive, 20
active, 6 peering, 239 active+clean; 0 bytes data, 10108 MB used, 1148 GB /
1158 GB avail
2014-08-04 16:39:06.371070 mon.0 [INF] pgmap v19: 267 pgs: 2 inactive, 20
active, 4 peering, 241 active+clean; 0 bytes data, 10108 MB used, 1148 GB /
1158 GB avail
2014-08-04 16:39:07.484259 mon.0 [INF] pgmap v20: 267 pgs: 267 active+clean; 0
bytes data, 10108 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:41:06.227435 mon.0 [INF] pgmap v21: 267 pgs: 267 active+clean; 0
bytes data, 10108 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:48:01.178851 mon.0 [INF] osd.2 209.243.160.59:6800/21186 failed
(3 reports from 2 peers after 24.931114 >= grace 20.000000)
2014-08-04 16:48:01.320953 mon.0 [INF] osdmap e11: 3 osds: 2 up, 3 in
2014-08-04 16:48:01.355520 mon.0 [INF] pgmap v22: 267 pgs: 100
stale+active+clean, 167 active+clean; 0 bytes data, 10108 MB used, 1148 GB /
1158 GB avail
2014-08-04 16:48:02.465783 mon.0 [INF] osdmap e12: 3 osds: 2 up, 3 in
2014-08-04 16:48:02.498833 mon.0 [INF] pgmap v23: 267 pgs: 100
stale+active+clean, 167 active+clean; 0 bytes data, 10108 MB used, 1148 GB /
1158 GB avail
2014-08-04 16:48:07.279702 mon.0 [INF] pgmap v24: 267 pgs: 71
stale+active+clean, 90 active+degraded, 106 active+clean; 0 bytes data, 10109
MB used, 1148 GB / 1158 GB avail
2014-08-04 16:48:08.352741 mon.0 [INF] pgmap v25: 267 pgs: 267 active+degraded;
0 bytes data, 10110 MB used, 1148 GB / 1158 GB avail
2014-08-04 16:48:22.268630 mon.0 [INF] pgmap v26: 267 pgs: 267 active+degraded;
112 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 68 B/s wr, 0 op/s; 2/3
objects degraded (66.667%)
2014-08-04 16:48:23.389449 mon.0 [INF] pgmap v27: 267 pgs: 267 active+degraded;
137 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 0 B/s rd, 135 B/s wr, 0
op/s; 4/6 objects degraded (66.667%)
2014-08-04 16:50:22.290200 mon.0 [INF] pgmap v28: 267 pgs: 267 active+degraded;
137 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 4/6 objects degraded
(66.667%)
2014-08-04 16:50:23.352788 mon.0 [INF] pgmap v29: 267 pgs: 267 active+degraded;
137 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 4/6 objects degraded
(66.667%)
2014-08-04 16:53:02.014805 mon.0 [INF] osd.2 out (down for 300.695457)
2014-08-04 16:53:02.119534 mon.0 [INF] osdmap e13: 3 osds: 2 up, 2 in
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com