Hi.
On Mon, Mar 10, 2014 at 12:54 PM, Gregory Farnum <[email protected]> wrote:
> Hmm, at first glance it looks like you're using multiple active MDSes
> and you've created some snapshots and part of that state got corrupted
> somehow. The log files should have a slightly more helpful (including
> line numbers) stack trace at the end, and might have more context for
> what's gone wrong.
> Also, what's the output of "ceph -s"?
>
this is from ceph on the system that is OK (mds is running fine that is).
[root@ip-10-16-20-11 ~]# ceph -s
health HEALTH_WARN mds d is laggy
monmap e9: 2 mons at {c=10.16.20.11:6789/0,d=10.16.43.12:6789/0},
election epoch 222, quorum 0,1 c,d
osdmap e8575: 2 osds: 2 up, 2 in
pgmap v326972: 676 pgs: 676 active+clean; 40623 MB data, 87077 MB used,
914 GB / 999 GB avail
mdsmap e283: 1/1/1 up {0=d=up:rejoin(laggy or crashed)}
I don't see anything more useful in the logs. I've pasted them here:
http://pastebin.com/n3fEaSfz
@Sage argh. I just realized that we were still latched on bobtail specific
repo. I'll try updating now.
> But I think you might be in some trouble from using two unstable
> features at the same time. :(
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, Mar 10, 2014 at 12:24 PM, Pawel Veselov <[email protected]>
> wrote:
> > Hi.
> >
> > All of a sudden, MDS started crashing, causing havoc on our deployment.
> > Any help would be greatly appreciated.
> >
> > ceph.x86_64 0.56.7-0.el6 @ceph
> >
> > -1> 2014-03-10 19:16:35.956323 7f9681cb3700 1 mds.0.12
> > rejoin_joint_start
> > 0> 2014-03-10 19:16:35.982031 7f9681cb3700 -1 *** Caught signal
> > (Segmentation fault) **
> > in thread 7f9681cb3700
> >
> > ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
> > 1: /usr/bin/ceph-mds() [0x813a91]
> > 2: (()+0xf8e0) [0x7f96863748e0]
> > 3: (SnapRealm::have_past_parents_open(snapid_t, snapid_t)+0x5a)
> [0x6be9da]
> > 4: (MDCache::check_realm_past_parents(SnapRealm*)+0x2b) [0x55fe7b]
> > 5: (MDCache::choose_lock_states_and_reconnect_caps()+0x29d) [0x567ddd]
> > 6: (MDCache::rejoin_gather_finish()+0x91) [0x59da91]
> > 7: (MDCache::rejoin_send_rejoins()+0x1b4f) [0x5a50bf]
> > 8: (MDS::rejoin_joint_start()+0x13e) [0x4a718e]
> > 9: (MDS::handle_mds_map(MMDSMap*)+0x2cda) [0x4bbf8a]
> > 10: (MDS::handle_core_message(Message*)+0x93b) [0x4bdfeb]
> > 11: (MDS::_dispatch(Message*)+0x2f) [0x4be0bf]
> > 12: (MDS::ms_dispatch(Message*)+0x19b) [0x4bfc9b]
> > 13: (DispatchQueue::entry()+0x309) [0x7e5cf9]
> > 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d607d]
> > 15: (()+0x7c6b) [0x7f968636cc6b]
> > 16: (clone()+0x6d) [0x7f968550e5ed]
> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to
> > interpret this.
> >
> > We are using stock executables from the repo, but just in case, here is
> what
> > I believe the point where it crashes:
> >
> > 6be9b5: 48 29 d0 sub %rdx,%rax
> > 6be9b8: 48 c1 f8 04 sar $0x4,%rax
> > 6be9bc: 48 83 f8 04 cmp $0x4,%rax
> > 6be9c0: 0f 86 81 02 00 00 jbe 6bec47
> > <_ZN9SnapRealm22have_past_parents_openE8snapid_tS0_+0x2c7>
> > 6be9c6: 83 7a 44 09 cmpl $0x9,0x44(%rdx)
> > 6be9ca: 0f 8f 83 04 00 00 jg 6bee53
> > <_ZN9SnapRealm22have_past_parents_openE8snapid_tS0_+0x4d3>
> > 6be9d0: 83 7a 40 09 cmpl $0x9,0x40(%rdx)
> > 6be9d4: 0f 8f 79 04 00 00 jg 6bee53
> > <_ZN9SnapRealm22have_past_parents_openE8snapid_tS0_+0x4d3>
> > 6be9da: 41 80 bc 24 98 00 00 cmpb $0x0,0x98(%r12)
> > 6be9e1: 00 00
> > 6be9e3: b8 01 00 00 00 mov $0x1,%eax
> > 6be9e8: 0f 85 51 01 00 00 jne 6beb3f
> > <_ZN9SnapRealm22have_past_parents_openE8snapid_tS0_+0x1bf>
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
--
With best of best regards
Pawel S. Veselov
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com