Re: [ceph-users] Ceph MDS laggy

Adam Tygart Sat, 19 Jan 2019 18:43:01 -0800

Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the
27th of December.


--
Adam

On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart <[email protected]> wrote:
>
> Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't 
> happen before then.
>
> --
> Adam
>
> On Sat, Jan 19, 2019, 20:17 Paul Emmerich <[email protected] wrote:
>>
>> Did this only start to happen after upgrading to 12.2.10?
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart <[email protected]> wrote:
>> >
>> > It worked for about a week, and then seems to have locked up again.
>> >
>> > Here is the back trace from the threads on the mds:
>> > http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
>> >
>> > --
>> > Adam
>> >
>> > On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng <[email protected]> wrote:
>> > >
>> > > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart <[email protected]> wrote:
>> > > >
>> > > > Restarting the nodes causes the hanging again. This means that this is
>> > > > workload dependent and not a transient state.
>> > > >
>> > > > I believe I've tracked down what is happening. One user was running
>> > > > 1500-2000 jobs in a single directory with 92000+ files in it. I am
>> > > > wondering if the cluster was getting ready to fragment the directory
>> > > > something freaked out, perhaps not able to get all the caps back from
>> > > > the nodes (if that is even required).
>> > > >
>> > > > I've stopped that user's jobs for the time being, and will probably
>> > > > address it with them Monday. If it is the issue, can I tell the mds to
>> > > > pre-fragment the directory before I re-enable their jobs?
>> > > >
>> > >
>> > > The log shows mds is in busy loop, but doesn't show where is it. If it
>> > > happens again, please use gdb to attach ceph-mds, then type 'set
>> > > logging on' and 'thread apply all bt' inside gdb. and send the output
>> > > to us
>> > >
>> > > Yan, Zheng
>> > > > --
>> > > > Adam
>> > > >
>> > > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart <[email protected]> wrote:
>> > > > >
>> > > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
>> > > > > minutes after that restarted the mds daemon. It replayed the journal,
>> > > > > evicted the dead compute nodes and is working again.
>> > > > >
>> > > > > This leads me to believe there was a broken transaction of some kind
>> > > > > coming from the compute nodes (also all running CentOS 7.6 and using
>> > > > > the kernel cephfs mount). I hope there is enough logging from before
>> > > > > to try to track this issue down.
>> > > > >
>> > > > > We are back up and running for the moment.
>> > > > > --
>> > > > > Adam
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart <[email protected]> wrote:
>> > > > > >
>> > > > > > Hello all,
>> > > > > >
>> > > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 
>> > > > > > 7.6.
>> > > > > >
>> > > > > > We're using cephfs and rbd.
>> > > > > >
>> > > > > > Last night, one of our two active/active mds servers went laggy and
>> > > > > > upon restart once it goes active it immediately goes laggy again.
>> > > > > >
>> > > > > > I've got a log available here (debug_mds 20, debug_objecter 20):
>> > > > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
>> > > > > >
>> > > > > > It looks like I might not have the right log levels. Thoughts on 
>> > > > > > debugging this?
>> > > > > >
>> > > > > > --
>> > > > > > Adam
>> > > > > > _______________________________________________
>> > > > > > ceph-users mailing list
>> > > > > > [email protected]
>> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > _______________________________________________
>> > > > > ceph-users mailing list
>> > > > > [email protected]
>> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > _______________________________________________
>> > > > ceph-users mailing list
>> > > > [email protected]
>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > _______________________________________________
>> > ceph-users mailing list
>> > [email protected]
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MDS laggy

Reply via email to