[ceph-users] Re: Preventing device zapping while replacing faulty drive (Squid 19.2.2)

Frédéric Nass Wed, 27 Aug 2025 05:29:57 -0700

Hi,

One way to avoid the OSD from being recreated with the faulty drive as soon
as it was zapped (without disabling the OSD service entirely) is to set the
_no_schedule label on the host with 'ceph orch host label add <hostname>
_no_schedule' and remove the label after the drive has been replaced.


Best regards,
Frédéric.


Frédéric Nass

Senior Ceph Engineer

Ceph Ambassador, France

  +49 89 215252-751 <https://call.ctrlq.org/+49%2089%20215252-751>

  frederic.n...@clyso.com

  www.clyso.com

  Hohenzollernstr. 27, 80801 Munich

Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306




Le mer. 20 août 2025 à 10:42, Eugen Block <ebl...@nde.ag> a écrit :

> Hi,
>
> I think I found the right place [0]:
>
> ---snip---
>              if any_replace_params:
>                  # mark destroyed in osdmap
>                  if not osd.destroy():
>                      raise orchestrator.OrchestratorError(
>                          f"Could not destroy {osd}")
>                  logger.info(
>                      f"Successfully destroyed old {osd} on
> {osd.hostname}; ready for replacement")
>                  if any_replace_params:
>                      osd.zap = True
> ...
>
>              if osd.zap:
>                  # throws an exception if the zap fails
>                  logger.info(f"Zapping devices for {osd} on
> {osd.hostname}")
>                  osd.do_zap()
> ---snip---
>
> So if the replace flag is set, Ceph will zap the device(s). I compared
> the versions, the change was between 19.2.0 and 19.2.1.
>
> On the one hand, I agree with OP, if Ceph immediately zaps the
> drive(s) it will redeploy the destroyed OSD with the faulty disk.
> On the other hand, if you don't let cephadm zap the drives you'll need
> manual intervention during the actual disk replacement. So the OSD
> will be purged, leaving the DB/WAL LVs on the disk.
>
> It would be interesting to learn what led to the decision to implement
> it like this, but I also don't see an "optimal" way doing this. I
> wonder if it could make sense to zap only DB/WAL devices, not the data
> device in case of replacement. Then when the faulty disk gets
> replaced, the orchestrator could redeploy an OSD since the data drive
> should be clean and there should be space for DB/WAL.
>
> Regards,
> Eugen
>
>
> [0]
>
> https://github.com/ceph/ceph/blob/v19.2.3/src/pybind/mgr/cephadm/services/osd.py#L909
>
> Zitat von Dmitrijs Demidovs <dmitrijs.demid...@carminered.eu>:
>
> > Hi List!
> >
> > We have Squid 19.2.2. It is cephadm/docker based deployment
> > (recently upgraded from Pacific 16.2.15).
> > We are using 8 SAS drives for Block and 2 SSD drives for DB on every
> > OSD Host.
> >
> >
> > Problem:
> >
> > One of the SAS Block drives failed on OSD Host and we need to replace it.
> > When our Ceph cluster was running on Pacific, we usually performed
> > drive replacement using this steps:
> >
> > 1) Edit -> MARK osd.xx as OUT [re-balancing starts, wait until it is
> > completed]
> > 2) Edit -> MARK osd.xx as DOWN
> > 3) Edit -> DELETE osd.xx [put check on "Preserve OSD ID" and "Yes, I
> > am sure"]
> > 4) Edit -> DESTROY osd.xx
> > 5) Edit -> PURGE osd.xx [re-balancing starts, wait until it is completed]
> > 6) Set "noout" and "norebalance" flags. Put OSD Host in maintenance
> > mode. Shutdown OSD Host. Replace failed drive. Start OSD Host.
> > 7) Wipe old DB Logical Volume (LV) [dd if=/dev/zero
> > of=/dev/ceph-xxx/osd-db-xxx bs=1M count=10 conv=fsync].
> > 8) Wipe new Block disk. Destroy old DB LV. Wait for automatic
> > discovery and creation of new osd.xx instance.
> >
> > In Pacific 16.2.15, after execution of PURGE command, Ceph just
> > removed old osd.xx instance from cluster without deletion/zapping of
> > DB and Block LVs.
> >
> > Now in Squid 19.2.2 we see what Ceph behaves differently.
> > Execution of step 3 (Edit -> DELETE osd.xx) automatically executes
> > DESTROY and PURGE, and after that Ceph automatically performs
> > zapping and deletion of DB LV and Block LV!
> > And after that it's automatic discovery finds "clean" SAS disk +
> > free space on SSD drive and happily forms new osd.xx instance form
> > failed drive what we need to replace :)
> >
> >
> > Questions:
> >
> > 1) What is correct procedure how-to replace failed Block drive in
> > Ceph Squid 19.2.2?
> > 2) Is it possible to disable Zapping?
> > 3) Is it possible to temporary disable automatic discovery of new
> > drives for OSD service?
> >
> >
> >
> >
> > P.S.
> >
> > Here is our Pacement Specification for OSD service:
> >
> > [ceph: root@ceph-mon12 /]# ceph orch ls osd
> > osd.dashboard-admin-1633624229976 --export
> > service_type: osd
> > service_id: dashboard-admin-1633624229976
> > service_name: osd.dashboard-admin-1633624229976
> > placement:
> >   host_pattern: '*'
> > spec:
> >   data_devices:
> >     rotational: true
> >   db_devices:
> >     rotational: false
> >   db_slots: 4
> >   filter_logic: AND
> >   objectstore: bluestore
> >
> >
> >
> >
> > Logs from Ceph:
> >
> > 12/8/25 08:54 AM [INF] Cluster is now healthy
> > 12/8/25 08:54 AM [INF] Health check cleared: PG_DEGRADED (was:
> > Degraded data redundancy: 11/121226847 objects degraded (0.000%), 1
> > pg degraded)
> > 12/8/25 08:54 AM [WRN] Health check update: Degraded data
> > redundancy: 11/121226847 objects degraded (0.000%), 1 pg degraded
> > (PG_DEGRADED)
> > 12/8/25 08:54 AM [INF] Health check cleared: PG_AVAILABILITY (was:
> > Reduced data availability: 1 pg peering)
> > 12/8/25 08:54 AM [WRN] Health check failed: Degraded data
> > redundancy: 12/121226847 objects degraded (0.000%), 2 pgs degraded
> > (PG_DEGRADED)
> > 12/8/25 08:54 AM [WRN] Health check failed: Reduced data
> > availability: 1 pg peering (PG_AVAILABILITY)
> > 12/8/25 08:54 AM [INF] osd.33
> > [v2:10.10.10.105:6824/297218036,v1:10.10.10.105:6825/297218036] boot
> > 12/8/25 08:54 AM [WRN] OSD bench result of 1909.305644 IOPS is not
> > within the threshold limit range of 50.000000 IOPS and 500.000000
> > IOPS for osd.33. IOPS capacity is unchanged at 315.000000 IOPS. The
> > recommendation is to establish the osd's IOPS capacity using other
> > benchmark tools (e.g. Fio) and then override
> > osd_mclock_max_capacity_iops_[hdd|ssd].
> > 12/8/25 08:53 AM [INF] Deploying daemon osd.33 on ceph-osd15
> > 12/8/25 08:53 AM [INF] Found osd claims for drivegroup
> > dashboard-admin-1633624229976 -> {'ceph-osd15': ['33']}
> > 12/8/25 08:53 AM [INF] Found osd claims -> {'ceph-osd15': ['33']}
> > 12/8/25 08:53 AM [INF] Detected new or changed devices on ceph-osd15
> > 12/8/25 08:52 AM [INF] Successfully zapped devices for osd.33 on
> ceph-osd15
> > 12/8/25 08:52 AM [INF] Zapping devices for osd.33 on ceph-osd15
> > 12/8/25 08:52 AM [INF] Successfully destroyed old osd.33 on
> > ceph-osd15; ready for replacement
> > 12/8/25 08:52 AM [INF] Successfully removed osd.33 on ceph-osd15
> > 12/8/25 08:52 AM [INF] Removing key for osd.33
> > 12/8/25 08:52 AM [INF] Removing daemon osd.33 from ceph-osd15 -- ports []
> >
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Preventing device zapping while replacing faulty drive (Squid 19.2.2)

Reply via email to