On 22/05/2015 15:33, Adam Tygart wrote:
Hello all,

The ceph-mds servers in our cluster are performing a constant
boot->replay->crash in our systems.

I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].

You found a bug, or more correctly you probably found multiple bugs...

It looks like your journal contains an EOpen event that lists 5307092 open files. Because the MDS only drops its lock between events, not during processing a single one, this is causing the heartbeat map to think the MDS has locked up, so it's getting killed.

So firstly we have to fix this to have appropriate calls into MDS::heartbeat_reset while iterating over lists of unbounded length in EMetablob::replay. That would fix the false death of the MDS resulting from the heartbeat expiry.

Secondly, this EOpen was a 2.6GB log event. Something has almost certainly gone wrong when we see that data structure grow so large, so we should really be imposing some artificial cap there and catching the situation earlier, rather than journal this monster event and only hitting issues during replay.

Thirdly, something is apparently leading the MDS to think that 5 million files were open in this particular log segment. It seems like an improbable situation given that I can only see a single client in action here. More investigation needed to see how this happened. Can you describe the client workload that was going on in the run up to the system breaking?

Anyway, actions:

1. I'm assuming your metadata is not sensitive, as you have shared this debug log. Please could you use "cephfs-journal-tool journal export ~/journal.bin" to grab an offline copy of the raw journal, in case we need to look at it later (this might take a while since your journal seems so large, but the resulting file should compress reasonably well with "tar cSzf").

2. optimistically, you may be able to get out of this situation by modifying the mds_beacon_grace config option on the MDS (set it to something high). This will cause the MDS to continue sending beacons to the mons, even when a thread is failing to yield promptly (as in this case), thereby preventing the mons from regarding the MDS as failed. This hopefully will buy the MDS enough time to complete replay and come up, assuming it doesn't run out of memory in the process of dealing with whatever strangeness is in the journal.

3. If your MDS eventually makes it through recovery, unmount your client and use "ceph daemon mds.<id> flush journal" to flush and trim the journal: this should result in a situation where the next time the MDS starts, the oversized journal entries are no longer present and startup should go smoothly.

Cheers,
John
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to