Hi all,

I am a new user on this list. I have a legacy production system running ceph 
version 0.94.7

Ceph itself appears to be functioning well, ceph -s is reporting good health.

I am connecting to the filesystem via an hdfs client. Upon connection I see the 
client receiving messages like (I’ve snipped since this goes on for a while)


hadoop fs -ls /


2019-03-06 17:30:02.444021 7fc381153700  0 -- XX.XX.XX.XX:0/3968222009 >> 
YY.YY.YY.YY:6801/1597 pipe(0x7fc3a8e74820 sd=191 :47936 s=2 pgs=25 cs=1 l=0 
c=0x7fc3a8e78ac0).fault, initiating reconnect

2019-03-06 17:30:02.444224 7fc38c1b3700  0 -- XX.XX.XX.XX:0/3968222009 >> 
YY.YY.YY.YY:6801/1597 pipe(0x7fc3a8e74820 sd=191 :47936 s=1 pgs=25 cs=2 l=0 
c=0x7fc3a8e78ac0).fault

2019-03-06 17:34:54.283031 7fc38c1b3700  0 -- XX.XX.XX.XX:0/3968222009 >> 
YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8125f30 sd=191 :51405 s=1 pgs=18 cs=2 l=0 
c=0x7fc3a812a1d0).connect got RESETSESSION

2019-03-06 17:34:54.283053 7fc383157700  0 client.2155885101 
ms_handle_remote_reset on YY.YY.YY.YY:6800/2651

2019-03-06 17:34:54.412070 7fc381153700  0 -- XX.XX.XX.XX:0/3968222009 >> 
YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8a83780 sd=192 :51406 s=2 pgs=19 cs=1 l=0 
c=0x7fc3a8e747c0).fault, initiating reconnect

2019-03-06 17:34:54.412363 7fc381052700  0 -- XX.XX.XX.XX:0/3968222009 >> 
YY.YY.YY.YY:6800/2651 pipe(0x7fc3a8a83780 sd=191 :51406 s=1 pgs=19 cs=2 l=0 
c=0x7fc3a8e747c0).fault

ls: Connection timed out


Which makese sense because the MDS crashes like this (it goes from active to 
reconnect state which I guess explains the change in PIDs that the client is 
seeing):


 *** Caught signal (Segmentation fault) **

 in thread 7f04b6b12700



 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)

 1: ceph_mds() [0x89982a]

 2: (()+0x10350) [0x7f04baecc350]

 3: (CInode::get_caps_allowed_for_client(client_t) const+0x130) [0x7a19f0]

 4: (CInode::encode_inodestat(ceph::buffer::list&, Session*, SnapRealm*, 
snapid_t, unsigned int, int)+0x132d) [0x7b383d]

 5: (Server::set_trace_dist(Session*, MClientReply*, CInode*, CDentry*, 
snapid_t, int, std::tr1::shared_ptr<MDRequestImpl>&)+0x471) [0x5f26e1]

 6: (Server::reply_client_request(std::tr1::shared_ptr<MDRequestImpl>&, 
MClientReply*)+0x846) [0x611056]

 7: (Server::respond_to_request(std::tr1::shared_ptr<MDRequestImpl>&, 
int)+0x4d9) [0x611759]

 8: (Server::handle_client_getattr(std::tr1::shared_ptr<MDRequestImpl>&, 
bool)+0x47b) [0x613eab]

 9: 
(Server::dispatch_client_request(std::tr1::shared_ptr<MDRequestImpl>&)+0xa38) 
[0x633da8]

 10: (Server::handle_client_request(MClientRequest*)+0x3df) [0x63435f]

 11: (Server::dispatch(Message*)+0x3f3) [0x63b8b3]

 12: (MDS::handle_deferrable_message(Message*)+0x847) [0x5b6c27]

 13: (MDS::_dispatch(Message*)+0x6d) [0x5d2bed]

 14: (MDS::ms_dispatch(Message*)+0xa2) [0x5d3f72]

 15: (DispatchQueue::entry()+0x63a) [0xa7482a]

 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x97403d]

 17: (()+0x8192) [0x7f04baec4192]

 18: (clone()+0x6d) [0x7f04ba3d126d]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.


So far I have gone through the whole path here:
http://docs.ceph.com/docs/hammer/cephfs/disaster-recovery/

I’ve reset the journal, session and fs – everything looks good (journal export 
core-dumps but all other status checks report healthy).

I’m hoping for a suggestion on what else could be causing this/what I can try 
resetting. The next step for me would be to remove the filesystem so I’m 
willing to try any suggestion.




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to