Dear All,

Unfortunately the MDS has crashed on our Mimic cluster...

First symptoms were rsync giving:
"No space left on device (28)"
when trying to rename or delete

This prompted me to try restarting the MDS, as it reported laggy.

Restarting the MDS, shows this as error in the log before the crash:

elist.h: 39: FAILED assert(!is_on_list())

A full MDS log showing the crash is here:

I've tried upgrading the cluster to 13.2.4, but the MDS still crashes...

The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x
replication for MDS. We have a single active MDS, with two failover MDS

We have ~2PB of cephfs data here, all of which is currently
inaccessible, all and any advice gratefully received :)

best regards,

