To answer some of my own questions:

1) Setting

ceph osd set noout
ceph osd set nodown
ceph osd set norebalance

before restart/re-deployment did not harm. I don't know if it helped, because I 
didn't retry the procedure that led to OSDs going down. See also point 3 below.

2) A peculiarity of this specific deployment of 2 OSDs was, that it was a mix 
of OSD deployment and restart after a reboot. I'm working on getting this 
sorted and this is a different story. For anyone who might find him-/herself in 
a situation where some OSDs are temporarily down/out with PGs remapped and 
objects degraded for whatever reason while new OSDs come up, the way to have 
ceph rescan the down/out OSDs after they come up is to

- "ceph osd crush move" the new OSDs temporarily to a location outside the 
crush sub tree covering any pools (I have such a parking space in the crush 
hierarchy for easy draining and parking disks)
- bring up the down/out OSDs
- at this point, the cluster will fall back to the original crush map that was 
in place when the OSDs went down/out
- the cluster will now find all shards that went orphan and health will be 
restored very quickly
- once the cluster is healthy, "ceph osd crush move" the new OSDs back to their 
desired location
- now you will see remapped PGs/misplaced objects, but no degraded objects

3) I still don't have an answer why long heartbeat ping times were observed. 
There seems to be a more serious issue and this will continue in its own thread 
"Cluster outage due to client IO" to be opened soon.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <[email protected]>
Sent: 25 April 2020 15:34:25
To: ceph-users
Subject: [ceph-users] Data loss by adding 2OSD causing Long heartbeat ping times

Dear all,

Two days ago I added very few disks to a ceph cluster and run into a problem I 
have never seen before when doing that. The entire cluster was deployed with 
mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added 
OSDs under 13.2.8.

I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one 
that needed 1. Procedure was as usual:

ceph osd set norebalance
deploy additional OSD

The OSD came up and PGs started peering, so far so good. To my surprise, 
however, I started seeing health-warnings about slow ping times:

Long heartbeat ping times on back interface seen, longest is 1171.910 msec
Long heartbeat ping times on front interface seen, longest is 1180.764 msec

After peering it looked like it got better and I waited it out until the 
messages were gone. This took a really long time, at least 5-10 minutes.

I went on to the next host and deployed 2 new OSDs this time. Same as above, 
but with much worse consequences. Apparently, the ping times exceeded a timeout 
for a very short moment and an OSD was marked out for ca. 2 seconds. Now all 
hell broke loose. I got health errors with the dreaded "backfill_toofull", 
undersized PGs and a large amount of degraded objects. I don't know what is 
causing what, but I ended up with data loss by just adding 2 disks.

We have dedicated network hardware and each of the OSD hosts has 20GBit front 
and 40GBit back network capacity (LACP trunking).  There are currently no more 
than 16 disks per server. The disks were added to an SSD pool. There was no 
traffic nor any other exceptional load on the system. I have ganglia resource 
monitoring on all nodes and cannot see a single curve going up. Network, CPU 
utilisation, load, everything below measurement accuracy. The hosts and network 
are quite overpowered and dimensioned to host many more OSDs (in future 
expansions).

I have three questions, ordered by how urgently I need an answer:

1) I need to add more disks next week and need a workaround. Will something 
like this help avoiding the heartbeat time-out:

ceph osd set noout
ceph osd set nodown
ceph osd set norebalance

2) The "lost" shards of the degraded objects were obviously still on the 
cluster somewhere. Is there any way to force the cluster to rescan OSDs for the 
shards that went orphan during the incident?

3) This smells a bit like a bug that requires attention. I was probably just 
lucky that I only lost 1 shard per PG. Has something similar reported before? 
Is this fixed in 13.2.10? Is it something new? Any settings that need to be 
looked at? If logs need to be collected, I can do so during my next attempt. 
However, I cannot risk data integrity of a production cluster and, therefore, 
probably not run the original procedure again.

Many thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to