Hi,

One way to avoid the OSD from being recreated with the faulty drive as soon
as it was zapped (without disabling the OSD service entirely) is to set the
_no_schedule label on the host with 'ceph orch host label add <hostname>
_no_schedule' and remove the label after the drive has been replaced.

Best regards,
Frédéric.


Frédéric Nass

Senior Ceph Engineer

Ceph Ambassador, France

  +49 89 215252-751 <https://call.ctrlq.org/+49%2089%20215252-751>

  frederic.n...@clyso.com

  www.clyso.com

  Hohenzollernstr. 27, 80801 Munich

Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306




Le mer. 20 août 2025 à 10:42, Eugen Block <ebl...@nde.ag> a écrit :

> Hi,
>
> I think I found the right place [0]:
>
> ---snip---
>              if any_replace_params:
>                  # mark destroyed in osdmap
>                  if not osd.destroy():
>                      raise orchestrator.OrchestratorError(
>                          f"Could not destroy {osd}")
>                  logger.info(
>                      f"Successfully destroyed old {osd} on
> {osd.hostname}; ready for replacement")
>                  if any_replace_params:
>                      osd.zap = True
> ...
>
>              if osd.zap:
>                  # throws an exception if the zap fails
>                  logger.info(f"Zapping devices for {osd} on
> {osd.hostname}")
>                  osd.do_zap()
> ---snip---
>
> So if the replace flag is set, Ceph will zap the device(s). I compared
> the versions, the change was between 19.2.0 and 19.2.1.
>
> On the one hand, I agree with OP, if Ceph immediately zaps the
> drive(s) it will redeploy the destroyed OSD with the faulty disk.
> On the other hand, if you don't let cephadm zap the drives you'll need
> manual intervention during the actual disk replacement. So the OSD
> will be purged, leaving the DB/WAL LVs on the disk.
>
> It would be interesting to learn what led to the decision to implement
> it like this, but I also don't see an "optimal" way doing this. I
> wonder if it could make sense to zap only DB/WAL devices, not the data
> device in case of replacement. Then when the faulty disk gets
> replaced, the orchestrator could redeploy an OSD since the data drive
> should be clean and there should be space for DB/WAL.
>
> Regards,
> Eugen
>
>
> [0]
>
> https://github.com/ceph/ceph/blob/v19.2.3/src/pybind/mgr/cephadm/services/osd.py#L909
>
> Zitat von Dmitrijs Demidovs <dmitrijs.demid...@carminered.eu>:
>
> > Hi List!
> >
> > We have Squid 19.2.2. It is cephadm/docker based deployment
> > (recently upgraded from Pacific 16.2.15).
> > We are using 8 SAS drives for Block and 2 SSD drives for DB on every
> > OSD Host.
> >
> >
> > Problem:
> >
> > One of the SAS Block drives failed on OSD Host and we need to replace it.
> > When our Ceph cluster was running on Pacific, we usually performed
> > drive replacement using this steps:
> >
> > 1) Edit -> MARK osd.xx as OUT [re-balancing starts, wait until it is
> > completed]
> > 2) Edit -> MARK osd.xx as DOWN
> > 3) Edit -> DELETE osd.xx [put check on "Preserve OSD ID" and "Yes, I
> > am sure"]
> > 4) Edit -> DESTROY osd.xx
> > 5) Edit -> PURGE osd.xx [re-balancing starts, wait until it is completed]
> > 6) Set "noout" and "norebalance" flags. Put OSD Host in maintenance
> > mode. Shutdown OSD Host. Replace failed drive. Start OSD Host.
> > 7) Wipe old DB Logical Volume (LV) [dd if=/dev/zero
> > of=/dev/ceph-xxx/osd-db-xxx bs=1M count=10 conv=fsync].
> > 8) Wipe new Block disk. Destroy old DB LV. Wait for automatic
> > discovery and creation of new osd.xx instance.
> >
> > In Pacific 16.2.15, after execution of PURGE command, Ceph just
> > removed old osd.xx instance from cluster without deletion/zapping of
> > DB and Block LVs.
> >
> > Now in Squid 19.2.2 we see what Ceph behaves differently.
> > Execution of step 3 (Edit -> DELETE osd.xx) automatically executes
> > DESTROY and PURGE, and after that Ceph automatically performs
> > zapping and deletion of DB LV and Block LV!
> > And after that it's automatic discovery finds "clean" SAS disk +
> > free space on SSD drive and happily forms new osd.xx instance form
> > failed drive what we need to replace :)
> >
> >
> > Questions:
> >
> > 1) What is correct procedure how-to replace failed Block drive in
> > Ceph Squid 19.2.2?
> > 2) Is it possible to disable Zapping?
> > 3) Is it possible to temporary disable automatic discovery of new
> > drives for OSD service?
> >
> >
> >
> >
> > P.S.
> >
> > Here is our Pacement Specification for OSD service:
> >
> > [ceph: root@ceph-mon12 /]# ceph orch ls osd
> > osd.dashboard-admin-1633624229976 --export
> > service_type: osd
> > service_id: dashboard-admin-1633624229976
> > service_name: osd.dashboard-admin-1633624229976
> > placement:
> >   host_pattern: '*'
> > spec:
> >   data_devices:
> >     rotational: true
> >   db_devices:
> >     rotational: false
> >   db_slots: 4
> >   filter_logic: AND
> >   objectstore: bluestore
> >
> >
> >
> >
> > Logs from Ceph:
> >
> > 12/8/25 08:54 AM [INF] Cluster is now healthy
> > 12/8/25 08:54 AM [INF] Health check cleared: PG_DEGRADED (was:
> > Degraded data redundancy: 11/121226847 objects degraded (0.000%), 1
> > pg degraded)
> > 12/8/25 08:54 AM [WRN] Health check update: Degraded data
> > redundancy: 11/121226847 objects degraded (0.000%), 1 pg degraded
> > (PG_DEGRADED)
> > 12/8/25 08:54 AM [INF] Health check cleared: PG_AVAILABILITY (was:
> > Reduced data availability: 1 pg peering)
> > 12/8/25 08:54 AM [WRN] Health check failed: Degraded data
> > redundancy: 12/121226847 objects degraded (0.000%), 2 pgs degraded
> > (PG_DEGRADED)
> > 12/8/25 08:54 AM [WRN] Health check failed: Reduced data
> > availability: 1 pg peering (PG_AVAILABILITY)
> > 12/8/25 08:54 AM [INF] osd.33
> > [v2:10.10.10.105:6824/297218036,v1:10.10.10.105:6825/297218036] boot
> > 12/8/25 08:54 AM [WRN] OSD bench result of 1909.305644 IOPS is not
> > within the threshold limit range of 50.000000 IOPS and 500.000000
> > IOPS for osd.33. IOPS capacity is unchanged at 315.000000 IOPS. The
> > recommendation is to establish the osd's IOPS capacity using other
> > benchmark tools (e.g. Fio) and then override
> > osd_mclock_max_capacity_iops_[hdd|ssd].
> > 12/8/25 08:53 AM [INF] Deploying daemon osd.33 on ceph-osd15
> > 12/8/25 08:53 AM [INF] Found osd claims for drivegroup
> > dashboard-admin-1633624229976 -> {'ceph-osd15': ['33']}
> > 12/8/25 08:53 AM [INF] Found osd claims -> {'ceph-osd15': ['33']}
> > 12/8/25 08:53 AM [INF] Detected new or changed devices on ceph-osd15
> > 12/8/25 08:52 AM [INF] Successfully zapped devices for osd.33 on
> ceph-osd15
> > 12/8/25 08:52 AM [INF] Zapping devices for osd.33 on ceph-osd15
> > 12/8/25 08:52 AM [INF] Successfully destroyed old osd.33 on
> > ceph-osd15; ready for replacement
> > 12/8/25 08:52 AM [INF] Successfully removed osd.33 on ceph-osd15
> > 12/8/25 08:52 AM [INF] Removing key for osd.33
> > 12/8/25 08:52 AM [INF] Removing daemon osd.33 from ceph-osd15 -- ports []
> >
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to