On Sat, Apr 15, 2023 at 4:58 PM Max Boone <[email protected]> wrote:
>
>
> After a critical node failure on my lab cluster, which won't come
> back up and is still down, the RBD objects are still being watched
> / mounted according to ceph. I can't shell to the node to rbd unbind
> them as the node is down. I am absolutely certain that nothing is
> using these images and they don't have snapshots either (and this IP
> is not even remotely close to the those of the monitors in the
> cluster). I blocked the IP usingceph osd blocklist add but after 30
> minutes, they are still being watched. Them being watched (they are
> RWO ceph-csi volumes) prevents me from re-using them in the cluster.
> As far as I'm aware, ceph should remove the watchers after 30 minutes
> and they've been blocklisted for hours now.
Hi Max,
A couple of general points:
- watch timeout is 30 seconds, not 30 minutes
- watcher IP doesn't have to match that of any of the monitors
> root@node0:~# rbd status
> kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
> Watchers:
> watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
> root@node0:~# rbd snap list
> kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
> root@node0:~# rbd info kubernetes/csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff
> rbd image 'csi-vol-e6a07ccd-93f6-4c47-a948-201501440fff':
> size 10 GiB in 2560 objects
> order 22 (4 MiB objects)
> snapshot_count: 0
> id: 4ff5353b865e1
> block_name_prefix: rbd_data.4ff5353b865e1
> format: 2
> features: layering
> op_features:
> flags:
> create_timestamp: Fri Mar 31 14:46:51 2023
> access_timestamp: Fri Mar 31 14:46:51 2023
> modify_timestamp: Fri Mar 31 14:46:51 2023
> root@node0:~# rados -p kubernetes listwatchers rbd_header.4ff5353b865e1
> watcher=10.0.0.103:0/992994811 client.1634081 cookie=139772597209280
> root@node0:~# ceph osd blocklist ls
> 10.0.0.103:0/0 2023-04-16T13:58:34.854232+0200
> listed 1 entries
> root@node0:~# ceph daemon osd.0 config get osd_client_watch_timeout
> {
> "osd_client_watch_timeout": "30"
> }
>
> Is it possible to kick a watcher out manually, or is there not much
> I can do here besides shutting down the entire cluster (or OSDs) and
> getting them back up? If it is a bug, I'm happy to help figuring out
> it's root cause and see if I can help writing a fix. Cheers, Max.
You may have hit https://tracker.ceph.com/issues/58120.
Try restarting the OSD that is holding the header object. To determine
the OSD, run "ceph osd map kubernetes rbd_header.4ff5353b865e1". The
output should end with something like "acting ([X, Y, Z], pX)", where X,
Y and Z are numbers. X is the OSD you want to restart.
Thanks,
Ilya
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]