On 16-12-22 13:26, Stéphane Klein wrote:
Hi,
I have:
* 3 mon
* 3 osd
When I shutdown one osd, I work great:
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
health HEALTH_WARN
43 pgs degraded
43 pgs stuck unclean
43 pgs undersized
recovery 24/70 objects degraded (34.286%)
too few PGs per OSD (28 < min 30)
1/3 in osds are down
monmap e1: 3 mons at
{ceph-mon-1=172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
<http://172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0>}
election epoch 10, quorum 0,1,2
ceph-mon-1,ceph-mon-2,ceph-mon-3
osdmap e22: 3 osds: 2 up, 3 in; 43 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v169: 64 pgs, 1 pools, 77443 kB data, 35 objects
252 MB used, 1484 GB / 1484 GB avail
24/70 objects degraded (34.286%)
43 active+undersized+degraded
21 active+clean
But, when I shutdown 2 osd, Ceph Cluster don't see that second osd
node is down :(
root@ceph-mon-1:/home/vagrant# ceph status
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
health HEALTH_WARN
clock skew detected on mon.ceph-mon-2
pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
Monitor clock skew detected
monmap e1: 3 mons at
{ceph-mon-1=172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
<http://172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0>}
election epoch 10, quorum 0,1,2
ceph-mon-1,ceph-mon-2,ceph-mon-3
osdmap e26: 3 osds: 2 up, 2 in
flags pauserd,pausewr,sortbitwise,require_jewel_osds
pgmap v203: 64 pgs, 1 pools, 77443 kB data, 35 objects
219 MB used, 989 GB / 989 GB avail
64 active+clean
2 osd up ! why ?
root@ceph-mon-1:/home/vagrant# ping ceph-osd-1 -c1
--- ceph-osd-1 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
root@ceph-mon-1:/home/vagrant# ping ceph-osd-2 -c1
--- ceph-osd-2 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
root@ceph-mon-1:/home/vagrant# ping ceph-osd-3 -c1
--- ceph-osd-3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.278/0.278/0.278/0.000 ms
My configuration:
ceph_conf_overrides:
global:
osd_pool_default_size: 2
osd_pool_default_min_size: 1
Full Ansible configuration is here:
https://github.com/harobed/poc-ceph-ansible/blob/master/vagrant-3mons-3osd/hosts/group_vars/all.yml#L11
What is my mistake? Is it Ceph bug?
try waiting a little longer. Mon needs multiple down reports to take OSD
down. And as your cluster is very small there is small amount (1 in this
case) of OSDs to report that others are down.
Best regards,
Stéphane
--
Stéphane Klein <[email protected]
<mailto:[email protected]>>
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com