Hi Alessandro,

What is the state of your PGs? Inactive PGs have blocked CephFS
recovery on our cluster before. I'd try to clear any blocked ops and
see if the MDSes recover.

--Lincoln

On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote:
> Hi,
> 
> I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded.
> 
> I have 2 active mds instances and 1 standby. All the active
> instances 
> are now in replay state and show the same error in the logs:
> 
> 
> ---- mds1 ----
> 
> 2018-01-08 16:04:15.765637 7fc2e92451c0  0 ceph version 12.2.2 
> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
> process 
> (unknown), pid 164
> starting mds.mds1 at -
> 2018-01-08 16:04:15.785849 7fc2e92451c0  0 pidfile_write: ignore
> empty 
> --pid-file
> 2018-01-08 16:04:20.168178 7fc2e1ee1700  1 mds.mds1 handle_mds_map
> standby
> 2018-01-08 16:04:20.278424 7fc2e1ee1700  1 mds.1.20635 handle_mds_map
> i 
> am now mds.1.20635
> 2018-01-08 16:04:20.278432 7fc2e1ee1700  1 mds.1.20635
> handle_mds_map 
> state change up:boot --> up:replay
> 2018-01-08 16:04:20.278443 7fc2e1ee1700  1 mds.1.20635 replay_start
> 2018-01-08 16:04:20.278449 7fc2e1ee1700  1 mds.1.20635  recovery set
> is 0
> 2018-01-08 16:04:20.278458 7fc2e1ee1700  1 mds.1.20635  waiting for 
> osdmap 21467 (which blacklists prior instance)
> 
> 
> ---- mds2 ----
> 
> 2018-01-08 16:04:16.870459 7fd8456201c0  0 ceph version 12.2.2 
> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
> process 
> (unknown), pid 295
> starting mds.mds2 at -
> 2018-01-08 16:04:16.881616 7fd8456201c0  0 pidfile_write: ignore
> empty 
> --pid-file
> 2018-01-08 16:04:21.274543 7fd83e2bc700  1 mds.mds2 handle_mds_map
> standby
> 2018-01-08 16:04:21.314438 7fd83e2bc700  1 mds.0.20637 handle_mds_map
> i 
> am now mds.0.20637
> 2018-01-08 16:04:21.314459 7fd83e2bc700  1 mds.0.20637
> handle_mds_map 
> state change up:boot --> up:replay
> 2018-01-08 16:04:21.314479 7fd83e2bc700  1 mds.0.20637 replay_start
> 2018-01-08 16:04:21.314492 7fd83e2bc700  1 mds.0.20637  recovery set
> is 1
> 2018-01-08 16:04:21.314517 7fd83e2bc700  1 mds.0.20637  waiting for 
> osdmap 21467 (which blacklists prior instance)
> 2018-01-08 16:04:21.393307 7fd837aaf700  0 mds.0.cache creating
> system 
> inode with ino:0x100
> 2018-01-08 16:04:21.397246 7fd837aaf700  0 mds.0.cache creating
> system 
> inode with ino:0x1
> 
> The cluster is recovering as we are changing some of the osds, and
> there 
> are a few slow/stuck requests, but I'm not sure if this is the cause,
> as 
> there is apparently no data loss (until now).
> 
> How can I force the MDSes to quit the replay state?
> 
> Thanks for any help,
> 
> 
>      Alessandro
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to