Hello,

We have recently had some failures with our MDS processes. We are running
Jewel 10.2.1. The two MDS services are on dedicated hosts running in
active/standby on Ubuntu 14.04.3 with kernel 3.19.0-56-generic. I have
searched the mailing list and open tickets without much luck so far.

The first indication of a problem is:

mds/Locker.cc: In function 'bool Locker::check_inode_max_size(CInode*,
bool, bool, uint64_t, bool, uint64_t, utime_t)' thread 7fc305b83700 time
2016-08-09 18:51:50.626630
mds/Locker.cc: 2190: FAILED assert(in->is_file())

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0x563d1e0a2d3b]
 2: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long, bool,
unsigned long, utime_t)+0x15e3) [0x563d1de506a3]
 3: (Server::handle_client_open(std::shared_ptr<MDRequestImpl>&)+0x1061)
[0x563d1dd386a1]
 4:
(Server::dispatch_client_request(std::shared_ptr<MDRequestImpl>&)+0xa0b)
[0x563d1dd5709b]
 5: (Server::handle_client_request(MClientRequest*)+0x47f) [0x563d1dd5768f]
 6: (Server::dispatch(Message*)+0x3bb) [0x563d1dd5b8db]
 7: (MDSRank::handle_deferrable_message(Message*)+0x80c) [0x563d1dce1f8c]
 8: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x563d1dceb081]
 9: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x563d1dcec1d5]
 10: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x563d1dcd3f83]
 11: (DispatchQueue::entry()+0x78b) [0x563d1e1996cb]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x563d1e08862d]
 13: (()+0x8184) [0x7fc30bd7c184]
 14: (clone()+0x6d) [0x7fc30a2d337d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

...

I snipped the dump of recent events, but can certainly include them if it
would help in debugging.

...

Upstart then attempts to restart the process, the logs from this are here:
https://gist.github.com/anonymous/256bd6e886421840d151890e0205766d

It looks to me like it goes through the replay -> reconnect -> rejoin ->
active process successfully and then immediately crashes with the same
error after becoming active. Upstart continues to try restarting until it
hits the max number of attempts. At that point the standby takes over and
goes through the same loop. Restarting manually gave the same issue on both
hosts. This process continued for several cycles before I rebooted the
physical host for the MDS process. At that point it started successfully
without issue. After rebooting the standby host it too was able to start
successfully.

Looking at metrics for the MDS host and the ceph cluster in general there
is nothing out of place or abnormal. CPU, memory, network, disk were all
within normal bounds. Other than the MDS processes failing the cluster was
healthy, no slow requests or failed OSDs.

Any thoughts on what might be causing this issue? Is there any further
information I can provide to help debug this?

Thanks in advance.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to