Thanks John, I got this in the mds log too:
2017-07-11 07:10:06.293219 7f1836837700 1 mds.beacon.b _send skipping beacon, heartbeat map not healthy 2017-07-11 07:10:08.330979 7f183b942700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 but that respawn happened 2 minutes after I got this: 2017-07-11 07:10:10.948237 7f183993e700 0 mds.beacon.b handle_mds_beacon no longer laggy Which makes me confused. Could it be a Network issue? Local network communication was fine by then. It might be a bug. When it was recovering it was stuck at rejoin_joint_start state for almost 50 minutes. 2017-07-11 07:13:36.587188 7f264a112700 1 mds.0.890528 rejoin_joint_start [...] 2017-07-11 07:56:21.521006 7f0f78917700 1 mds.0.890537 recovery_done -- successful recovery! 2017-07-11 07:56:21.522570 7f0f78917700 1 mds.0.890537 active_start 2017-07-11 07:56:21.533507 7f0f78917700 1 mds.0.890537 cluster recovered. I watched with "ceph daemon mds.b perf dump mds" that it was scanning the inodes. But when this happens (quite often) I have no idea when it will stop. Many other times this happened was because of a crash ( http://tracker.ceph.com/issues/20535) but today was not the case. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* On Tue, Jul 11, 2017 at 11:36 AM, John Spray <[email protected]> wrote: > On Tue, Jul 11, 2017 at 3:23 PM, Webert de Souza Lima > <[email protected]> wrote: > > Hello, > > > > today I got a MDS respawn with the following message: > > > > 2017-07-11 07:07:55.397645 7ffb7a1d7700 1 mds.b handle_mds_map i > > (10.0.1.2:6822/28190) dne in the mdsmap, respawning myself > > "dne in the mdsmap" is what an MDS says when the monitors have > concluded that the MDS is dead, but the MDS is really alive. "dne" > stands for "does not exist", so the MDS is complaining that it has > been removed from the mdsmap. > > The message could definitely be better worded! > > You can see this happen in certain buggy cases where the MDS is > failing to send beacon messages to the mons, even though it is really > alive -- if you're stuck in rejoin, then that is probably related: try > increasing the log verbosity to work out where the MDS is stuck while > it's sitting in the rejoin state. > > John > > > > > it happened 3 times within 5 minutes. After so, the MDS took 50 minutes > to > > recover. > > I can't find what exactly that message means and how to avoid it. > > > > I'll be glad to provide any further information. Thanks! > > > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > Belo Horizonte - Brasil > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
