[ceph-users] Preventing device zapping while replacing faulty drive (Squid 19.2.2)

Dmitrijs Demidovs Tue, 12 Aug 2025 01:48:22 -0700

Hi List!

We have Squid 19.2.2. It is cephadm/docker based deployment (recentlyupgraded from Pacific 16.2.15).We are using 8 SAS drives for Block and 2 SSD drives for DB on every OSDHost.



Problem:

One of the SAS Block drives failed on OSD Host and we need to replace it.

When our Ceph cluster was running on Pacific, we usually performed drivereplacement using this steps:

1) Edit -> MARK osd.xx as OUT [re-balancing starts, wait until it iscompleted]

2) Edit -> MARK osd.xx as DOWN

3) Edit -> DELETE osd.xx [put check on "Preserve OSD ID" and "Yes, I amsure"]

4) Edit -> DESTROY osd.xx
5) Edit -> PURGE osd.xx [re-balancing starts, wait until it is completed]

6) Set "noout" and "norebalance" flags. Put OSD Host in maintenancemode. Shutdown OSD Host. Replace failed drive. Start OSD Host.7) Wipe old DB Logical Volume (LV) [dd if=/dev/zeroof=/dev/ceph-xxx/osd-db-xxx bs=1M count=10 conv=fsync].8) Wipe new Block disk. Destroy old DB LV. Wait for automatic discoveryand creation of new osd.xx instance.

In Pacific 16.2.15, after execution of PURGE command, Ceph just removedold osd.xx instance from cluster without deletion/zapping of DB andBlock LVs.


Now in Squid 19.2.2 we see what Ceph behaves differently.

Execution of step 3 (Edit -> DELETE osd.xx) automatically executesDESTROY and PURGE, and after that Ceph automatically performs zappingand deletion of DB LV and Block LV!And after that it's automatic discovery finds "clean" SAS disk + freespace on SSD drive and happily forms new osd.xx instance form faileddrive what we need to replace :)



Questions:

1) What is correct procedure how-to replace failed Block drive in CephSquid 19.2.2?

2) Is it possible to disable Zapping?

3) Is it possible to temporary disable automatic discovery of new drivesfor OSD service?





P.S.

Here is our Pacement Specification for OSD service:

[ceph: root@ceph-mon12 /]# ceph orch ls osdosd.dashboard-admin-1633624229976 --export

service_type: osd
service_id: dashboard-admin-1633624229976
service_name: osd.dashboard-admin-1633624229976
placement:
  host_pattern: '*'
spec:
  data_devices:
    rotational: true
  db_devices:
    rotational: false
  db_slots: 4
  filter_logic: AND
  objectstore: bluestore




Logs from Ceph:

12/8/25 08:54 AM [INF] Cluster is now healthy

12/8/25 08:54 AM [INF] Health check cleared: PG_DEGRADED (was: Degradeddata redundancy: 11/121226847 objects degraded (0.000%), 1 pg degraded)12/8/25 08:54 AM [WRN] Health check update: Degraded data redundancy:11/121226847 objects degraded (0.000%), 1 pg degraded (PG_DEGRADED)12/8/25 08:54 AM [INF] Health check cleared: PG_AVAILABILITY (was:Reduced data availability: 1 pg peering)12/8/25 08:54 AM [WRN] Health check failed: Degraded data redundancy:12/121226847 objects degraded (0.000%), 2 pgs degraded (PG_DEGRADED)12/8/25 08:54 AM [WRN] Health check failed: Reduced data availability: 1pg peering (PG_AVAILABILITY)12/8/25 08:54 AM [INF] osd.33[v2:10.10.10.105:6824/297218036,v1:10.10.10.105:6825/297218036] boot12/8/25 08:54 AM [WRN] OSD bench result of 1909.305644 IOPS is notwithin the threshold limit range of 50.000000 IOPS and 500.000000 IOPSfor osd.33. IOPS capacity is unchanged at 315.000000 IOPS. Therecommendation is to establish the osd's IOPS capacity using otherbenchmark tools (e.g. Fio) and then overrideosd_mclock_max_capacity_iops_[hdd|ssd].

12/8/25 08:53 AM [INF] Deploying daemon osd.33 on ceph-osd15

12/8/25 08:53 AM [INF] Found osd claims for drivegroupdashboard-admin-1633624229976 -> {'ceph-osd15': ['33']}

12/8/25 08:53 AM [INF] Found osd claims -> {'ceph-osd15': ['33']}
12/8/25 08:53 AM [INF] Detected new or changed devices on ceph-osd15
12/8/25 08:52 AM [INF] Successfully zapped devices for osd.33 on ceph-osd15
12/8/25 08:52 AM [INF] Zapping devices for osd.33 on ceph-osd15

12/8/25 08:52 AM [INF] Successfully destroyed old osd.33 on ceph-osd15;ready for replacement

12/8/25 08:52 AM [INF] Successfully removed osd.33 on ceph-osd15
12/8/25 08:52 AM [INF] Removing key for osd.33
12/8/25 08:52 AM [INF] Removing daemon osd.33 from ceph-osd15 -- ports []





_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Preventing device zapping while replacing faulty drive (Squid 19.2.2)

Reply via email to