[ceph-users] Re: CEPHADM_HOST_CHECK_FAILED

2024-04-04 Thread Adam King
First, I guess I would make sure that peon7 and peon12 actually could pass
the host check (you can run "cephadm check-host" on the host directly if
you have a copy of the cephadm binary there) Then I'd try a mgr failover
(ceph mgr fail) to clear out any in memory host values cephadm might have
and restart the module. If it still reproduces after that, then you might
have to set mgr/cephadm/log_to_cluster_level to debug, do another mgr
failover, wait until the module crashes and see if "ceph log last 100 debug
cephadm" gives more info on where the crash occurred (it might have an
actual traceback).

On Thu, Apr 4, 2024 at 4:51 AM  wrote:

> Hi,
>
> I’ve added some new nodes to our Ceph cluster. Only did the host add, had
> not added the OSD’s yet.
> Due to a configuration error I had to reinstall some of them. But I forgot
> to remove the nodes from Ceph first. I did a “ceph orch host rm peon7
> --offline —force” before re-adding them to the cluster.
>
> All the nodes are showing up in the host list (all the peons are the new
> ones):
>
> # ceph orch host ls
> HOST ADDR LABELS  STATUS
> ceph110.103.0.71
> ceph210.103.0.72
> ceph310.103.0.73
> ceph410.103.0.74
> compute1 10.103.0.11
> compute2 10.103.0.12
> compute3 10.103.0.13
> compute4 10.103.0.14
> controller1  10.103.0.8
> controller2  10.103.0.9
> controller3  10.103.0.10
> peon110.103.0.41
> peon210.103.0.42
> peon310.103.0.43
> peon410.103.0.44
> peon510.103.0.45
> peon610.103.0.46
> peon710.103.0.47
> peon810.103.0.48
> peon910.103.0.49
> peon10   10.103.0.50
> peon12   10.103.0.52
> peon13   10.103.0.53
> peon14   10.103.0.54
> peon15   10.103.0.55
> peon16   10.103.0.56
>
> But Ceph status still shows an error, which I can’t seem to get rid off.
>
> [WRN] CEPHADM_HOST_CHECK_FAILED: 2 hosts fail cephadm check
> host peon7 (10.103.0.47) failed check: Can't communicate with remote
> host `10.103.0.47`, possibly because python3 is not installed there or you
> are missing NOPASSWD in sudoers. [Errno 113] Connect call failed
> ('10.103.0.47', 22)
> host peon12 (10.103.0.52) failed check: Can't communicate with remote
> host `10.103.0.52`, possibly because python3 is not installed there or you
> are missing NOPASSWD in sudoers. [Errno 113] Connect call failed
> ('10.103.0.52', 22)
> [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'peon7'
> Module 'cephadm' has failed: ‘peon7'
>
> From the mgr log:
>
> Apr 04 08:33:46 controller2 bash[4031857]: debug
> 2024-04-04T08:33:46.876+ 7f2bb5710700 -1 mgr.server reply reply (5)
> Input/output error Module 'cephadm' has experienced an error and cannot
> handle commands: 'peon7'
>
> Any idea how to clear this error?
>
> # ceph --version
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
> (stable)
>
>
> Regards,
> Arnoud de Jonge.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPHADM_HOST_CHECK_FAILED after reboot of nodes

2021-07-12 Thread mabi
I have now opened a bug issue as this must be a bug with cephadm:

https://tracker.ceph.com/issues/51629

Hopefully someone has time to look into that.

Thank you in advance.

‐‐‐ Original Message ‐‐‐

On Friday, July 9th, 2021 at 8:11 AM, mabi  wrote:

> Hello,
>
> I rebooted all the 8 nodes of my Octopus 15.2.13 cluster which runs on Ubuntu 
> 20.04 LTS with cephadm and since then cephadm see 7 nodes as unreachable as 
> you can see below:
>
> [WRN] CEPHADM_HOST_CHECK_FAILED: 7 hosts fail cephadm check
>
> host ceph1d failed check: Can't communicate with remote host `ceph1d`, 
> possibly because python3 is not installed there: [Errno 32] Broken pipe
>
> host ceph1g failed check: Can't communicate with remote host `ceph1g`, 
> possibly because python3 is not installed there: [Errno 32] Broken pipe
>
> host ceph1c failed check: Can't communicate with remote host `ceph1c`, 
> possibly because python3 is not installed there: [Errno 32] Broken pipe
>
> host ceph1e failed check: Can't communicate with remote host `ceph1e`, 
> possibly because python3 is not installed there: [Errno 32] Broken pipe
>
> host ceph1f failed check: Can't communicate with remote host `ceph1f`, 
> possibly because python3 is not installed there: [Errno 32] Broken pipe
>
> host ceph1b failed check: Can't communicate with remote host `ceph1b`, 
> possibly because python3 is not installed there: [Errno 32] Broken pipe
>
> host ceph1h failed check: Failed to connect to ceph1h (ceph1h).
>
> Please make sure that the host is reachable and accepts connections using the 
> cephadm SSH key
>
> To add the cephadm SSH key to the host:
>
> > ceph cephadm get-pub-key > ~/ceph.pub
> >
> > ssh-copy-id -f -i ~/ceph.pub root@ceph1h
>
> To check that the host is reachable:
>
> > ceph cephadm get-ssh-config > ssh_config
> >
> > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> >
> > chmod 0600 ~/cephadm_private_key
> >
> > ssh -F ssh_config -i ~/cephadm_private_key root@ceph1h
>
> I checked and SSH is working and python3 is installed on all nodes.
>
> As you can see here "ceph orch host ls" also shows nodes as offline:
>
> ceph orch host ls
> =
>
> HOST ADDR LABELS STATUS
>
> ceph1a ceph1a _admin mon
>
> ceph1b ceph1b _admin mon Offline
>
> ceph1c ceph1c _admin mon Offline
>
> ceph1d ceph1d Offline
>
> ceph1e ceph1e Offline
>
> ceph1f ceph1f Offline
>
> ceph1g ceph1g mds Offline
>
> ceph1h ceph1h mds Offline
>
> Does anyone have a clue how I can fix that? cephadm seems to be broken...
>
> Thank you for your help.
>
> Regards,
>
> Mabi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io