I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic
issue arise where in all my MDS clients will become stuck, and the fix so far
has been to restart the active MDS (sometimes I need to restart the subsequent
active MDS as well).
These clients are using the cephfs-hadoop API, so there is no kernel client, or
fuse api involved. When I see clients get stuck, there are messages printed to
stderr like the following:
2016-09-21 10:31:12.285030 7fea4c7fb700 0 – 192.168.1.241:0/1606648601 >>
192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0
I’m at somewhat of a loss on where to begin debugging this issue, and wanted to
ping the list for ideas.
I managed to dump the mds cache during one of the stalled moments, which
hopefully is a useful starting point:
ceph-users mailing list