Hi,
I have a sad ceph cluster.
All my osds complain about failed reply on heartbeat, like so:
osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
ever on either front or back, first ping sent 2019-01-16
22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
.. I've checked the network sanity all I can, and all ceph ports are
open between nodes both on the public network and the cluster network,
and I have no problems sending traffic back and forth between nodes.
I've tried tcpdump'ing and traffic is passing in both directions
between the nodes, but unfortunately I don't natively speak the ceph
protocol, so I can't figure out what's going wrong in the heartbeat
conversation.
Still:
# ceph health detail
HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
pgs inactive, 1072 pgs peering
OSDMAP_FLAGS nodown,noout flag(s) set
PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
pg 7.3cd is stuck inactive for 245901.560813, current state
creating+peering, last acting [13,41,1]
pg 7.3ce is stuck peering for 245901.560813, current state
creating+peering, last acting [1,40,7]
pg 7.3cf is stuck peering for 245901.560813, current state
creating+peering, last acting [0,42,9]
pg 7.3d0 is stuck peering for 245901.560813, current state
creating+peering, last acting [20,8,38]
pg 7.3d1 is stuck peering for 245901.560813, current state
creating+peering, last acting [10,20,42]
(....)
I've set "noout" and "nodown" to prevent all osd's from being removed
from the cluster. They are all running and marked as "up".
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 249.73434 root default
-25 166.48956 datacenter m1
-24 83.24478 pod kube1
-35 41.62239 rack 10
-34 41.62239 host ceph-sto-p102
40 hdd 7.27689 osd.40 up 1.00000 1.00000
41 hdd 7.27689 osd.41 up 1.00000 1.00000
42 hdd 7.27689 osd.42 up 1.00000 1.00000
(....)
I'm at a point where I don't know which options and what logs to check anymore?
Any debug hint would be very much appreciated.
btw. I have no important data in the cluster (yet), so if the solution
is to drop all osd and recreate them, it's ok for now. But I'd really
like to know how the cluster ended in this state.
/Johan
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com