Yes, this was a package install and ceph-debuginfo was used and hopefully the output of the backtrace is useful.
I thought it was interesting that you mentioned reproduce with an ls because aside from me doing a large dd before this issue surfaced, your post made me recall that I also ran ls a few times to drill down and eventually list the files that are located two subdirectories down around the same time. I also recall for a moment that I found it strange that I got results back so quickly because our netapp takes forever to do this....it was so quick, that in retrospect, the list of files may not have been complete. I regret not following up that thought. On Tue, Aug 11, 2015 at 1:52 AM, John Spray <[email protected]> wrote: > On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko <[email protected]> wrote: > > I had a dual mds server configuration and have been copying data via > cephfs > > kernel module to my cluster for the past 3 weeks and just had a MDS crash > > halting all IO. Leading up to the crash, I ran a test dd that increased > the > > throughput by about 2x and stopped it but about 10 minutes later, the MDS > > server crashed and did not fail over to the standby properly. I have > using > > an active/standby mds configuration but neither of the mds servers will > stay > > running at this point and crash after starting them. > > > > [bababurko@cephmon01 ~]$ sudo ceph -s > > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > > health HEALTH_WARN > > mds cluster is degraded > > mds cephmds02 is laggy > > noscrub,nodeep-scrub flag(s) set > > monmap e1: 3 mons at > > {cephmon01= > 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0 > } > > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} > > osdmap e324: 30 osds: 30 up, 30 in > > flags noscrub,nodeep-scrub > > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects > > 14051 GB used, 13880 GB / 27931 GB avail > > 2112 active+clean > > > > > > I am not sure what information is relevant so I will try to cover what I > > think is relevant based on posts I have read through: > > > > Cluster: > > running ceph-0.94.1 on CenttOS 7.1 > > [root@mdstest02 bababurko]$ uname -r > > 3.10.0-229.el7.x86_64 > > > > Here is my ceph-mds log with 'debug objector = 10' : > > > > > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= > > Ouch! Unfortunately all we can tell from this is that we're hitting > an assertion somewhere while loading a directory fragment from disk. > > As Zheng says, you'll need to drill a bit deeper. If you were > installing from packages you may find ceph-debuginfo useful. In > addition to getting us a clearer stack trace with debug symbols, > please also crank "debug mds" up to 20 (this is massively verbose so > hopefully it doesn't take too long to reproduce the issue). > > Hopefully this is fairly straightforward to reproduce. If it's > something fundamentally malformed on disk then just doing a recursive > ls on the filesystem would trigger it, at least. > > Cheers, > John >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
