Re: [ceph-users] mds server(s) crashed

Bob Ababurko Tue, 11 Aug 2015 10:44:16 -0700

Yes, this was a package install and ceph-debuginfo was used and hopefully
the output of the backtrace is useful.


I thought it was interesting that you mentioned reproduce with an ls
because aside from me doing a large dd before this issue surfaced, your
post made me recall that I also ran ls a few times to drill down and
eventually list the files that are located two subdirectories down around
the same time.  I also recall for a moment that I found it strange that I
got results back so quickly because our netapp takes forever to do
this....it was so quick, that in retrospect, the list of files may not have
been complete.  I regret not following up that thought.



On Tue, Aug 11, 2015 at 1:52 AM, John Spray <[email protected]> wrote:

> On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko <[email protected]> wrote:
> > I had a dual mds server configuration and have been copying data via
> cephfs
> > kernel module to my cluster for the past 3 weeks and just had a MDS crash
> > halting all IO.  Leading up to the crash, I ran a test dd that increased
> the
> > throughput by about 2x and stopped it but about 10 minutes later, the MDS
> > server crashed and did not fail over to the standby properly. I have
> using
> > an active/standby mds configuration but neither of the mds servers will
> stay
> > running at this point and crash after starting them.
> >
> > [bababurko@cephmon01 ~]$ sudo ceph -s
> >     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
> >      health HEALTH_WARN
> >             mds cluster is degraded
> >             mds cephmds02 is laggy
> >             noscrub,nodeep-scrub flag(s) set
> >      monmap e1: 3 mons at
> > {cephmon01=
> 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
> }
> >             election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
> >      mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
> >      osdmap e324: 30 osds: 30 up, 30 in
> >             flags noscrub,nodeep-scrub
> >       pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
> >             14051 GB used, 13880 GB / 27931 GB avail
> >                 2112 active+clean
> >
> >
> > I am not sure what information is relevant so I will try to cover what I
> > think is relevant based on posts I have read through:
> >
> > Cluster:
> > running ceph-0.94.1 on CenttOS 7.1
> > [root@mdstest02 bababurko]$ uname -r
> > 3.10.0-229.el7.x86_64
> >
> > Here is my ceph-mds log with 'debug objector = 10' :
> >
> >
> https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=
>
> Ouch!  Unfortunately all we can tell from this is that we're hitting
> an assertion somewhere while loading a directory fragment from disk.
>
> As Zheng says, you'll need to drill a bit deeper.  If you were
> installing from packages you may find ceph-debuginfo useful.  In
> addition to getting us a clearer stack trace with debug symbols,
> please also crank "debug mds" up to 20 (this is massively verbose so
> hopefully it doesn't take too long to reproduce the issue).
>
> Hopefully this is fairly straightforward to reproduce.  If it's
> something fundamentally malformed on disk then just doing a recursive
> ls on the filesystem would trigger it, at least.
>
> Cheers,
> John
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds server(s) crashed

Reply via email to