Hi Chris

Having also recently started exploring Ceph. I too happened upon this problem.
I found that terminating the command witch ctrl-c seemed to stop the looping. 
Which btw. also happens on all other mgr instances in the cluster.

Regards

Jens

-----Original Message-----
From: Chris Read <chris.r...@gmail.com> 
Sent: 11. januar 2021 21:54
To: ceph-users@ceph.io
Subject: [ceph-users] "ceph orch restart mgr" command creates mgr restart loop

Greetings all...

I'm busy testing out Ceph and have hit this troublesome bug while following the 
steps outlined here:

https://docs.ceph.com/en/octopus/cephadm/monitoring/#configuring-ssl-tls-for-grafana

When I issue the "ceph orch restart mgr" command, it appears the command is not 
cleared from a message queue somewhere (I'm still very unclear on many ceph 
specifics), and so each time the mgr process returns from restart it picks up 
the message again and keeps restarting itself forever (so far it's been stuck 
in this state for 45 minutes).

Watching the logs we see this going on:

$ ceph log last cephadm -w

root@ceph-poc-000:~# ceph log last cephadm -w
  cluster:
    id:     d23bc326-543a-11eb-bfe0-b324db228b6c
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum
ceph-poc-000,ceph-poc-003,ceph-poc-004,ceph-poc-002,ceph-poc-001 (age 2h)
    mgr: ceph-poc-000.himivo(active, since 4s), standbys:
ceph-poc-001.unjulx
    osd: 10 osds: 10 up (since 2h), 10 in (since 2h)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   10 GiB used, 5.4 TiB / 5.5 TiB avail
    pgs:     1 active+clean


2021-01-11T20:46:32.976606+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:46:32.980749+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:46:33.061519+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available
2021-01-11T20:46:39.156420+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:46:39.160618+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:46:39.242603+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available
2021-01-11T20:46:45.299953+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:46:45.304006+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:46:45.733495+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available
2021-01-11T20:46:51.871903+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:46:51.877107+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:46:51.976190+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available
2021-01-11T20:46:58.000720+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:46:58.006843+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:46:58.097163+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available
2021-01-11T20:47:04.188630+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:47:04.193501+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:47:04.285509+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available
2021-01-11T20:47:10.348099+0000 mon.ceph-poc-000 [INF] Active manager daemon 
ceph-poc-000.himivo restarted
2021-01-11T20:47:10.352340+0000 mon.ceph-poc-000 [INF] Activating manager 
daemon ceph-poc-000.himivo
2021-01-11T20:47:10.752243+0000 mon.ceph-poc-000 [INF] Manager daemon 
ceph-poc-000.himivo is now available

And in the logs for the mgr instance itself we see it keep replaying the 
message over and over:

$ docker logs -f
ceph-d23bc326-543a-11eb-bfe0-b324db228b6c-mgr.ceph-poc-000.himivo
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 set uid:gid to 167:167
(ceph:ceph)
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process ceph-mgr, 
pid 1 debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 pidfile_write: ignore 
empty --pid-file debug 2021-01-11T20:47:31.414+0000 7f48b0d0d200  1 mgr[py] 
Loading python module 'alerts'
debug 2021-01-11T20:47:31.486+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'balancer'
debug 2021-01-11T20:47:31.542+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'cephadm'
debug 2021-01-11T20:47:31.742+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'crash'
debug 2021-01-11T20:47:31.798+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'dashboard'
debug 2021-01-11T20:47:32.258+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'devicehealth'
debug 2021-01-11T20:47:32.306+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'diskprediction_local'
debug 2021-01-11T20:47:32.498+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'influx'
debug 2021-01-11T20:47:32.550+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'insights'
debug 2021-01-11T20:47:32.598+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'iostat'
debug 2021-01-11T20:47:32.642+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'k8sevents'
debug 2021-01-11T20:47:33.034+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'localpool'
debug 2021-01-11T20:47:33.082+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'orchestrator'
debug 2021-01-11T20:47:33.266+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'osd_support'
debug 2021-01-11T20:47:33.310+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'pg_autoscaler'
debug 2021-01-11T20:47:33.382+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'progress'
debug 2021-01-11T20:47:33.442+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'prometheus'
debug 2021-01-11T20:47:33.794+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'rbd_support'
debug 2021-01-11T20:47:33.870+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'restful'
debug 2021-01-11T20:47:34.086+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'rook'
debug 2021-01-11T20:47:34.606+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'selftest'
debug 2021-01-11T20:47:34.654+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'status'
debug 2021-01-11T20:47:34.710+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'telegraf'
debug 2021-01-11T20:47:34.762+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'telemetry'
debug 2021-01-11T20:47:35.034+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'test_orchestrator'
debug 2021-01-11T20:47:35.146+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'volumes'
debug 2021-01-11T20:47:35.294+0000 7f48b0d0d200  1 mgr[py] Loading python 
module 'zabbix'
debug 2021-01-11T20:47:35.366+0000 7f489df86700  0 ms_deliver_dispatch:
unhandled message 0x55d57b71a420 mon_map magic: 0 v1 from mon.0 v2:
10.128.18.16:3300/0
debug 2021-01-11T20:47:35.758+0000 7f489df86700  1 mgr handle_mgr_map 
Activating!
debug 2021-01-11T20:47:35.758+0000 7f489df86700  1 mgr handle_mgr_map I am now 
activating debug 2021-01-11T20:47:35.770+0000 7f484ca14700  0 [balancer DEBUG 
root] setting log level based on debug_mgr: WARNING (1/5) debug 
2021-01-11T20:47:35.770+0000 7f484ca14700  1 mgr load Constructed class from 
module: balancer debug 2021-01-11T20:47:35.774+0000 7f484ca14700  0 [cephadm 
DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 
2021-01-11T20:47:35.874+0000 7f484ca14700  1 mgr load Constructed class from 
module: cephadm debug 2021-01-11T20:47:35.874+0000 7f484ca14700  0 [crash DEBUG 
root] setting log level based on debug_mgr: WARNING (1/5) debug 
2021-01-11T20:47:35.874+0000 7f484ca14700  1 mgr load Constructed class from 
module: crash debug 2021-01-11T20:47:35.878+0000 7f484ca14700  0 [dashboard 
DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 
2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed class from 
module: dashboard debug 2021-01-
 11T20:47:35.878+0000 7f484ca14700  0 [devicehealth DEBUG root] setting log 
level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.878+0000 
7f484ca14700  1 mgr load Constructed class from module: devicehealth debug 
2021-01-11T20:47:35.878+0000 7f484ca14700  0 [iostat DEBUG root] setting log 
level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.878+0000 
7f484ca14700  1 mgr load Constructed class from module: iostat debug 
2021-01-11T20:47:35.886+0000 7f484ca14700  0 [orchestrator DEBUG root] setting 
log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.886+0000 
7f484ca14700  1 mgr load Constructed class from module: orchestrator debug 
2021-01-11T20:47:35.890+0000 7f484ca14700  0 [pg_autoscaler DEBUG root] setting 
log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.890+0000 
7f484ca14700  1 mgr load Constructed class from module: pg_autoscaler debug 
2021-01-11T20:47:35.890+0000 7f484ca14700  0 [progress DEBUG root] setting log 
leve
 l based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.890+0000 
7f484ca14700  1 mgr load Constructed class from module: progress debug 
2021-01-11T20:47:35.898+0000 7f484ca14700  0 [prometheus DEBUG root] setting 
log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.910+0000 
7f484ca14700  1 mgr load Constructed class from module: prometheus debug 
2021-01-11T20:47:35.914+0000 7f484ca14700  0 [rbd_support DEBUG root] setting 
log level based on debug_mgr: WARNING (1/5) [11/Jan/2021:20:47:35] ENGINE Bus 
STARTING
Warning: Permanently added 'ceph-poc-000,10.128.18.16' (ECDSA) to the list of 
known hosts.
Warning: Permanently added 'ceph-poc-002,10.128.19.248' (ECDSA) to the list of 
known hosts.
debug 2021-01-11T20:47:35.974+0000 7f484ca14700  1 mgr load Constructed class 
from module: rbd_support
Warning: Permanently added 'ceph-poc-001,10.128.28.223' (ECDSA) to the list of 
known hosts.
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [restful DEBUG root] setting 
log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.982+0000 
7f484ca14700  1 mgr load Constructed class from module: restful debug 
2021-01-11T20:47:35.982+0000 7f484ca14700  0 [status DEBUG root] setting log 
level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.982+0000 
7f484ca14700  1 mgr load Constructed class from module: status debug 
2021-01-11T20:47:35.982+0000 7f484ca14700  0 [telemetry DEBUG root] setting log 
level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.982+0000 
7f484ca14700  1 mgr load Constructed class from module: telemetry debug 
2021-01-11T20:47:35.982+0000 7f486f3cc700  0 [restful WARNING root] server not 
running: no certificate configured debug 2021-01-11T20:47:35.982+0000 
7f484ca14700  0 [volumes DEBUG root] setting log level based on debug_mgr: 
WARNING (1/5)
Warning: Permanently added 'ceph-poc-004,10.128.31.60' (ECDSA) to the list of 
known hosts.
debug 2021-01-11T20:47:35.994+0000 7f484ca14700  1 mgr load Constructed class 
from module: volumes
Warning: Permanently added 'ceph-poc-003,10.128.30.127' (ECDSA) to the list of 
known hosts.
CherryPy Checker:
The Application mounted at '' has an empty config.

[11/Jan/2021:20:47:36] ENGINE Serving on http://:::9283 [11/Jan/2021:20:47:36] 
ENGINE Bus STARTED debug 2021-01-11T20:47:36.778+0000 7f484d215700  0 
log_channel(audit) log [DBG] : from='client.19140 -' entity='client.admin' 
cmd=[{"prefix": "orch",
"action": "restart", "service_name": "mgr", "target": ["mon-mgr", ""]}]:
dispatch
debug 2021-01-11T20:47:36.778+0000 7f484ca14700  0 log_channel(cephadm) log 
[INF] : Restart service mgr debug 2021-01-11T20:47:36.806+0000 7f484d215700  0 
log_channel(cluster) log [DBG] : pgmap v3: 1 pgs: 1 active+clean; 0 B data, 208 
MiB used, 5.4 TiB /
5.5 TiB avail
debug 2021-01-11T20:47:37.406+0000 7f489ef88700 -1 received  signal:
Terminated from Kernel ( Could be generated by pthread_kill(), raise(), 
abort(), alarm() ) UID: 0 debug 2021-01-11T20:47:37.406+0000 7f489ef88700 -1 
mgr handle_mgr_signal
 *** Got signal Terminated ***


Is there anyway to manually clear this stuck message? Will it eventually 
timeout by itself?

Either way it's a pretty unpleasant surprise to find when following the docs so 
early in the process.

Thanks,

Chris
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to