Hi all,
We are running the Nautilus cluster. Today due to UPS work, we shut
down the whole cluster.
After we start the cluster, many OSDs go down and they seem to start
doing the heardbeat_check using the public network. For example, we
see the following logs:
---
2023-05-16 19:35:29.254 7efcd4ce7700 -1 osd.101 42916 heartbeat_check:
no reply from 131.174.45.223:6825 osd.185 ever on either front or back,
first ping sent 2023-05-16 19:34:48.593701 (oldest deadline 2023-05-16
19:35:08.593701)
---
While I was expect the heardbeat to go through the cluster network,
e.g. instead of 131.174.45.223, it should use 172.20.128.223.
In fact, when we start up the cluster, we don't have DNS available to
resolve the IP addresses, and for a short while, all OSDs are located
in a new host called "localhost.localdomain". At that point, I fixed
it by setting the static hostname using `hostnamectl set -hostname
xxx`.
Now we cannot bring the cluster back to healthy state. We are stuck
at:
---
cluster:
id: 86c9bc85-b7f3-49a1-9e1f-8c9f2b31fca8
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem has a failed mds daemon
1 filesystem is offline
insufficient standby MDS daemons available
pauserd,pausewr,noout,nobackfill,norebalance,norecover
flag(s) set
88 osds down
Reduced data availability: 2544 pgs inactive, 2369 pgs
down, 159 pgs peering, 294 pgs stale
Degraded data redundancy: 870424/2714593746 objects
degraded (0.032%), 30 pgs degraded, 9 pgs undersized
8631 slow ops, oldest one blocked for 803 sec, mon.ceph-
mon01 has slow ops
services:
mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03 (age 2h)
mgr: ceph-mon02(active, since 2h), standbys: ceph-mon01, ceph-mon03
mds: cephfs:0/1, 1 failed
osd: 191 osds: 103 up (since 1.46827s), 191 in (since 4w)
flags pauserd,pausewr,noout,nobackfill,norebalance,norecover
data:
pools: 2 pools, 2560 pgs
objects: 456.13M objects, 657 TiB
usage: 1.1 PiB used, 638 TiB / 1.7 PiB avail
pgs: 100.000% pgs not active
870424/2714593746 objects degraded (0.032%)
2087 down
282 stale+down
129 peering
32 stale+peering
30 undersized+degraded+peered
---
Any idea we could fix it and get the OSDs to use cluster network to do
heartbeat checks. Any help would be highly appreciated. Thank you
very much.
Cheers, Hong
--
Hurng-Chun (Hong) Lee, PhD
ICT manager
Donders Institute for Brain, Cognition and Behaviour,
Centre for Cognitive Neuroimaging
Radboud University Nijmegen
e-mail: [email protected]
tel: +31(0)631132518
web: http://www.ru.nl/donders/
pgp: 3AC505B2B787A8ABE2C551B1362976D838ABF09E
* Mon, Tue and Thu at Trigon; Wed and Fri working from home
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]