On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
minutes after that restarted the mds daemon. It replayed the journal,
evicted the dead compute nodes and is working again.

This leads me to believe there was a broken transaction of some kind
coming from the compute nodes (also all running CentOS 7.6 and using
the kernel cephfs mount). I hope there is enough logging from before
to try to track this issue down.

We are back up and running for the moment.
--
Adam



On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart <[email protected]> wrote:
>
> Hello all,
>
> I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.
>
> We're using cephfs and rbd.
>
> Last night, one of our two active/active mds servers went laggy and
> upon restart once it goes active it immediately goes laggy again.
>
> I've got a log available here (debug_mds 20, debug_objecter 20):
> https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
>
> It looks like I might not have the right log levels. Thoughts on debugging 
> this?
>
> --
> Adam
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to