[ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD installation incomplete

2024-04-03 Thread Eugen Block
Hi and sorry for the delay, I was on vacation last week. :-) I just  
read your responses. I have no idea how to modify the default timeout  
for cephadm, maybe Adam or someone else can comment on that. But  
everytime I've been watching cephadm (ceph-volume) create new OSDs  
they are not created in parallel but sequentially. That would probably  
explain why it takes that long and eventually runs into a timeout. I  
can't really confirm but it would make sense to me if this would be  
the reason for the "offline" hosts during the operation. Does it  
resolve after the process has finished?



 (Excerpt below. Is there any preferred method to provide bigger logs?).


It was just an example, I would focus on one OSD host and inspect the  
cephadm.log. The command I mentioned collects logs from all hosts, and  
it sometimes can be misleading because of missing timestamps.

Did you collect more infos in the meantime?

Thanks,
Eugen

Zitat von "Kuhring, Mathias" :

I'm afraid the parameter mgr  
mgr/cephadm/default_cephadm_command_timeout is buggy.
Once not on default anymore, MGR is preparing the parameter a bit  
(e.g. substracting 5 secs)
And there making it float, but cephadm is not having it (not even if  
I try the default 900 myself):


[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s):  
osd.all-available-devices
osd.all-available-devices: cephadm exited with an error code: 2,  
stderr:usage:  
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b

   [-h] [--image IMAGE] [--docker] [--data-dir DATA_DIR]
   [--log-dir LOG_DIR] [--logrotate-dir LOGROTATE_DIR]
   [--sysctl-dir SYSCTL_DIR] [--unit-dir UNIT_DIR] [--verbose]
   [--timeout TIMEOUT] [--retry RETRY] [--env ENV] [--no-container-init]
   [--no-cgroups-split]

{version,pull,inspect-image,ls,list-networks,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,zap-osds,unit,logs,bootstrap,deploy,check-host,prepare-host,add-repo,rm-repo,install,registry-login,gather-facts,host-maintenance,agent,disk-rescan}

   ...
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b:  
error: argument --timeout: invalid int value: '895.0'


This also let to a status panic spiral reporting plenty of host and  
services missing failing (I assume orch failing due to cephadm  
complaining about the parameter).
I got it under control by removing the parameter again from the  
config (ceph config rm mgr  
mgr/cephadm/default_cephadm_command_timeout).
And the restarting all MGRs manually (systemctl restart..., again  
since orch was kinda useless at this stage).


Anyhow, is there any other way I can adapt this parameter?
Or maybe look into speeding up LV creation (if this is the bootleneck)?

Thanks a lot,
Mathias

-Original Message-
From: Kuhring, Mathias 
Sent: Friday, March 22, 2024 5:38 PM
To: Eugen Block ; ceph-users@ceph.io
Subject: [ceph-users] Re: [ext] Re: cephadm auto disk preparation  
and OSD installation incomplete


Hey Eugen,

Thank you for the quick reply.

The 5 missing disks on the one host were completely installed after  
I fully cleaned them up as I described.

So, seems a smaller number of disks can make it.

Regarding the other host with 40 disks:
Failing the MGR didn't have any effect.
There are nor errors in `/var/log/ceph/cephadm.log`.
But a bunch of repeating image listings like:
cephadm --image  
quay.io/ceph/ceph@sha256:1fb108217b110c01c480e32d0cfea0e19955733537af7bb8cbae165222496e09 --timeout 895  
ls


But `ceph log last 200 debug cephadm` gave me a bunch of interesting  
errors (Excerpt below. Is there any preferred method to provide  
bigger logs?).


So, there are some timeouts, which might play into the assumption  
that ceph-volume is a bit overwhelmed by the number of disks.
Shy assumption, but maybe LV creation is taking way too long (is  
cephadm waiting for all of them in bulk?) and times out with the  
default 900 secs.
However, LVs are created and cephadm will not consider them next  
round ("has a filesystem").


I'm testing this theory right now by bumping up the limit to 2 hours  
(and the restart with "fresh" disks again):

ceph config set mgr mgr/cephadm/default_cephadm_command_timeout 7200

However, there are also mentions of the host being not reachable:  
"Unable to reach remote host ceph-3-11"
But this seems to be limited to cephadm / ceph orch, so basically  
MGR but not the rest of the cluster (i.e. MONs, OSDs, etc. are  
communicating happily, as far as I can tell).


During my fresh run, I do notice more hosts being apperently down:
0|0[root@ceph-3-10 ~]# ceph orch host ls | grep Offline
ceph-3-7  172.16.62.38  rgw,osd,_admin Offline
ceph-3-10 172.16.62.41  rgw,osd,_admin,prometheus  Offline
ceph-3-11 172.16.62.43  rgw,osd,_admin Offline
osd-mirror-2  172.16.62.23  rgw,osd,_admin Offline
osd-mirror-3  172.16.62.24  rgw,osd,_ad

[ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD installation incomplete

2024-03-22 Thread Kuhring, Mathias
I'm afraid the parameter mgr mgr/cephadm/default_cephadm_command_timeout is 
buggy.
Once not on default anymore, MGR is preparing the parameter a bit (e.g. 
substracting 5 secs)
And there making it float, but cephadm is not having it (not even if I try the 
default 900 myself):

[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): 
osd.all-available-devices
osd.all-available-devices: cephadm exited with an error code: 2, 
stderr:usage: 
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b
   [-h] [--image IMAGE] [--docker] [--data-dir DATA_DIR]
   [--log-dir LOG_DIR] [--logrotate-dir LOGROTATE_DIR]
   [--sysctl-dir SYSCTL_DIR] [--unit-dir UNIT_DIR] [--verbose]
   [--timeout TIMEOUT] [--retry RETRY] [--env ENV] [--no-container-init]
   [--no-cgroups-split]
   
{version,pull,inspect-image,ls,list-networks,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,zap-osds,unit,logs,bootstrap,deploy,check-host,prepare-host,add-repo,rm-repo,install,registry-login,gather-facts,host-maintenance,agent,disk-rescan}
   ...
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b: 
error: argument --timeout: invalid int value: '895.0'

This also let to a status panic spiral reporting plenty of host and services 
missing failing (I assume orch failing due to cephadm complaining about the 
parameter).
I got it under control by removing the parameter again from the config (ceph 
config rm mgr mgr/cephadm/default_cephadm_command_timeout).
And the restarting all MGRs manually (systemctl restart..., again since orch 
was kinda useless at this stage).

Anyhow, is there any other way I can adapt this parameter?
Or maybe look into speeding up LV creation (if this is the bootleneck)?

Thanks a lot,
Mathias

-Original Message-
From: Kuhring, Mathias  
Sent: Friday, March 22, 2024 5:38 PM
To: Eugen Block ; ceph-users@ceph.io
Subject: [ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD 
installation incomplete

Hey Eugen,

Thank you for the quick reply.

The 5 missing disks on the one host were completely installed after I fully 
cleaned them up as I described.
So, seems a smaller number of disks can make it.

Regarding the other host with 40 disks:
Failing the MGR didn't have any effect.
There are nor errors in `/var/log/ceph/cephadm.log`.
But a bunch of repeating image listings like:
cephadm --image 
quay.io/ceph/ceph@sha256:1fb108217b110c01c480e32d0cfea0e19955733537af7bb8cbae165222496e09
 --timeout 895 ls

But `ceph log last 200 debug cephadm` gave me a bunch of interesting errors 
(Excerpt below. Is there any preferred method to provide bigger logs?).

So, there are some timeouts, which might play into the assumption that 
ceph-volume is a bit overwhelmed by the number of disks.
Shy assumption, but maybe LV creation is taking way too long (is cephadm 
waiting for all of them in bulk?) and times out with the default 900 secs.
However, LVs are created and cephadm will not consider them next round ("has a 
filesystem").

I'm testing this theory right now by bumping up the limit to 2 hours (and the 
restart with "fresh" disks again):
ceph config set mgr mgr/cephadm/default_cephadm_command_timeout 7200

However, there are also mentions of the host being not reachable: "Unable to 
reach remote host ceph-3-11"
But this seems to be limited to cephadm / ceph orch, so basically MGR but not 
the rest of the cluster (i.e. MONs, OSDs, etc. are communicating happily, as 
far as I can tell).

During my fresh run, I do notice more hosts being apperently down:
0|0[root@ceph-3-10 ~]# ceph orch host ls | grep Offline
ceph-3-7  172.16.62.38  rgw,osd,_admin Offline
ceph-3-10 172.16.62.41  rgw,osd,_admin,prometheus  Offline
ceph-3-11 172.16.62.43  rgw,osd,_admin Offline
osd-mirror-2  172.16.62.23  rgw,osd,_admin Offline
osd-mirror-3  172.16.62.24  rgw,osd,_admin Offline

But I wonder if this just a side effect of the MGR (cephadm/orch) being too 
busy/overwhelmed with e.g. deploying the new OSDs.

I will update you once the next round is done or failed.

Best Wishes,
Mathias


ceph log last 200 debug cephadm
...
2024-03-20T09:19:24.917834+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 82122 : 
cephadm [INF] Detected new or changed devices on ceph-3-11
2024-03-20T09:34:28.877718+ mgr.osd-mirror-4.dkzbkw (mgr.339518816) 83339 : 
cephadm [ERR] Failed to apply osd.all-available-devices spec 
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_id: all-available-devices
service_name: osd.all-available-devices
placement:
  host_pattern: '*'
spec:
  data_devices:
all: true
  filter_logic: AND
  objectstore: bluestore
''')): Command timed out on host cephadm deploy (osd daemon) (default 900 
second timeout) ...

raise TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another excepti

[ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD installation incomplete

2024-03-22 Thread Kuhring, Mathias
pjq5uxhj1:/var/lib/ceph/bootstrap-osd/ceph.keyring:z 
quay.io/ceph/ceph@sha256:1fb108217b110c01c480e32d0cfea0e19955733537af7bb8cbae16522249
 6e09 lvm batch --no-auto /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr 
/dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz --yes 
--no-systemd
...

/usr/bin/docker: stderr raise RuntimeError("Device {} has a 
filesystem.".format(self.dev_path))
/usr/bin/docker: stderr RuntimeError: Device /dev/sdm has a filesystem.
...

RuntimeError: cephadm exited with an error code: 1, stderr:Inferring config 
/var/lib/ceph/7efa00f9-182f-40f4-9136-d51895db1f0b/config/ceph.conf
...

-Original Message-
From: Eugen Block  
Sent: Thursday, March 21, 2024 3:28 PM
To: ceph-users@ceph.io
Subject: [ext] [ceph-users] Re: cephadm auto disk preparation and OSD 
installation incomplete

Hi,

before getting into that the first thing I would do is to fail the mgr. There 
have been too many issues where failing over the mgr resolved many of them.
If that doesn't help, the cephadm.log should show something useful 
(/var/log/ceph/cephadm.log on the OSD hosts, I'm still not too familiar with 
the whole 'ceph log last 200 debug cephadm' thing).
I remember reports in earlier versions of ceph-volume (probably
pre-cephadm) where not all OSDs were created if the host had many disks to 
deploy. But I can't find those threads right now.
And it's strange that on the second cluster no OSD is created at all, but 
again, maybe fail the mgr first before looking deeper into it.

Regards,
Eugen

Zitat von "Kuhring, Mathias" :

> Dear ceph community,
>
> We have trouble with new disks not being properly prepared resp.  
> OSDs not being fully installed by cephadm.
> We just added one new node each with ~40 HDDs each to two of our ceph 
> clusters.
> In one cluster all but 5 disks got installed automatically.
> In the other none got installed.
>
> We are on ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) on both 
> clusters.
> (I haven't added new disks since the last upgrade if I recall correctly).
>
> This is our OSD service definition:
> ```
> 0|0[root@ceph-3-10 ~]# ceph orch ls osd --export
> service_type: osd
> service_id: all-available-devices
> service_name: osd.all-available-devices
> placement:
>   host_pattern: '*'
> spec:
>   data_devices:
> all: true
>   filter_logic: AND
>   objectstore: bluestore
> ---
> service_type: osd
> service_id: unmanaged
> service_name: osd.unmanaged
> unmanaged: true
> spec:
>   filter_logic: AND
>   objectstore: bluestore
> ```
>
> Usually, new disks are installed properly (as expected due to 
> all-available-devices).
> This time, I can see that LVs were created (via `lsblk`, `lvs`, 
> `cephadm ceph-volume lvm list`).
> And OSDs are entered to the crushmap.
> However, they are not assigned to a host yet, nor do they have a type 
> or weight, e.g.:
> ```
> 0|0[root@ceph-2-10 ~]# ceph osd tree | grep "0  osd"
> 518  0  osd.518   down 0  1.0
> 519  0  osd.519   down 0  1.0
> 520  0  osd.520   down 0  1.0
> 521  0  osd.521   down 0  1.0
> 522  0  osd.522   down 0  1.0
> ```
>
> And there is also no OSD daemon created (no docker container).
> So, OSD creation is somehow stuck halfway.
>
> I thought of fully cleaning up the OSD/disks.
> Hopping cephadm might pick them up properly next time.
> Just zapping was not possible, e.g. `cephadm ceph-volume lvm zap 
> --destroy /dev/sdab` results in these errors:
> ```
> /usr/bin/docker: stderr  stderr: wipefs: error: /dev/sdab: probing 
> initialization failed: Device or resource busy
> /usr/bin/docker: stderr --> failed to wipefs device, will try again to 
> workaround probable race condition ```
>
> So, I cleaned up more manually with purging them from crush and 
> "resetting" disk and LV with dd and dmsetup, resp.:
> ```
> ceph osd purge 480 --force
> dd if=/dev/zero of=/dev/sdab bs=1M count=1 dmsetup remove
> ceph--e10e0f08--8705--441a--8caa--4590de22a611-osd--block--d464211c--f
> 513--4513--86c1--c7ad63e6c142
> ```
>
> ceph-volume still reported the old volumes, but then zapping actually 
> got rid of them (only cleaned out the left-over entries, I guess).
>
> Now, cephadm was able to get one OSD up, when I did this cleanup for 
> only one disk.
> When I did it in bulk for the rest, they all got stuck again the same way.
>
> Looking into ceph-volume logs (here for osd.522 as representative):
> ```
> 0|0[root@ceph-2-11
> /var/log/ceph/55633ec3-6c0c-4a02-990c-0f87e0f7a01f