Perhaps related, I was watching the active mds with debug_mds set to 5/5, when
I saw this in the log:
2016-09-21 15:13:26.067698 7fbaec248700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0
c=0x5631ce0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e8420).fault with nothing to send, going to standby
2016-09-21 15:13:26.067725 7fbb0098e700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0
c=0x15c29020).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067743 7fbb026ab700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e91e0).fault, server,
going to standby
2016-09-21 15:13:26.067749 7fbae840a700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0
c=0x13b6c160).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067783 7fbadb239700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0
c=0x28e24f20).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067803 7fbafe66b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e9760).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067876 7fbb01a9f700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92f5a0).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067886 7fbb01ca1700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cc60).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067865 7fbaf43c9700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0
c=0x37c902c0).fault with no
thing to send, going to standby
2016-09-21 15:13:26.067910 7fbaf4ed4700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0
c=0x56a5440).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cf20).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0
c=0x31f7a100).fault with no
thing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e8420).fault w[0/9326]ing to send, going to standby
2016-09-21 15:13:26.067725 7fbb0098e700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0
c=0x15c29020).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067743 7fbb026ab700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e91e0).fault, server,
going to standby
2016-09-21 15:13:26.067749 7fbae840a700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0
c=0x13b6c160).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067783 7fbadb239700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0
c=0x28e24f20).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067803 7fbafe66b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e9760).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067876 7fbb01a9f700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92f5a0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067886 7fbb01ca1700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cc60).fault with nothing to send, going to standby
2016-09-21 15:13:26.067865 7fbaf43c9700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0
c=0x37c902c0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067910 7fbaf4ed4700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0
c=0x56a5440).fault with nothing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cf20).fault with nothing to send, going to standby
2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0
c=0x31f7a100).fault with nothing to send, going to standby
2016-09-21 15:13:26.068095 7fbafff84700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.140:0/1112150414 pipe(0x5679000 sd=125 :6802 s=2 pgs=2 cs=1 l=0
c=0x41bc7e0).fault with nothing to send, going to standby
2016-09-21 15:13:26.068108 7fbb0de0e700 5 mds.0.953 handle_mds_map epoch 8471
from mon.3
2016-09-21 15:13:26.068114 7fbaf890e700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.238:0/1422203298 pipe(0x29630000 sd=44 :6802 s=2 pgs=2 cs=1 l=0
c=0x3a740dc0).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068143 7fbae860c700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.217:0/1120082018 pipe(0x2a724000 sd=121 :6802 s=2 pgs=2 cs=1 l=0
c=0x31f79e40).fault with no
thing to send, going to standby
2016-09-21 15:13:26.068190 7fbb040c5700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.218:0/3945638891 pipe(0x50c0000 sd=53 :6802 s=2 pgs=2 cs=1 l=0
c=0x56f4420).fault with nothi
ng to send, going to standby
2016-09-21 15:13:26.068200 7fbaf961b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.144:0/2952053583 pipe(0x318dc000 sd=81 :6802 s=2 pgs=2 cs=1 l=0
c=0x286fa840).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068232 7fbaf981d700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.159:0/1872775873 pipe(0x268d7000 sd=38 :6802 s=2 pgs=2 cs=1 l=0
c=0x56f6940).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.068253 7fbaeac32700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.186:0/4141441999 pipe(0x54e7000 sd=86 :6802 s=2 pgs=2 cs=1 l=0
c=0x286fb760).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.068275 7fbb0de0e700 1 mds.-1.-1 handle_mds_map i
(192.168.1.196:6802/13581) dne in the mdsmap, respawning myself
2016-09-21 15:13:26.068289 7fbb0de0e700 1 mds.-1.-1 respawn
2016-09-21 15:13:26.068294 7fbb0de0e700 1 mds.-1.-1 e: 'ceph-mds'
2016-09-21 15:13:26.173095 7f689baa8780 0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mds, pid 13581
2016-09-21 15:13:26.175664 7f689baa8780 -1 mds.-1.0 log_to_monitors
{default=true}
2016-09-21 15:13:27.329181 7f68969e9700 1 mds.-1.0 handle_mds_map standby
2016-09-21 15:13:28.484148 7f68969e9700 1 mds.-1.0 handle_mds_map standby
2016-09-21 15:13:33.280376 7f68969e9700 1 mds.-1.0 handle_mds_map standby
On 9/21/16, 10:48 AM, "Heller, Chris" <[email protected]> wrote:
I’ll see if I can capture the output the next time this issue arises, but
in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph
health detail’ results in HEALTH_OK, the mds server is in the up:active state,
in general it’s as if nothing is wrong server side (at least from the summary).
-Chris
On 9/21/16, 10:46 AM, "Gregory Farnum" <[email protected]> wrote:
On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <[email protected]>
wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been seeing a
> periodic issue arise where in all my MDS clients will become stuck,
and the
> fix so far has been to restart the active MDS (sometimes I need to
restart
> the subsequent active MDS as well).
>
>
>
> These clients are using the cephfs-hadoop API, so there is no kernel
client,
> or fuse api involved. When I see clients get stuck, there are messages
> printed to stderr like the following:
>
>
>
> 2016-09-21 10:31:12.285030 7fea4c7fb700 0 –
192.168.1.241:0/1606648601 >>
> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0
l=0
> c=0x7feaa0a0c500).fault
>
>
>
> I’m at somewhat of a loss on where to begin debugging this issue, and
wanted
> to ping the list for ideas.
What's the full output of "ceph -s" when this happens? Have you looked
at the MDS' admin socket's ops-in-flight, and that of the clients?j
http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some
as well.
>
>
>
> I managed to dump the mds cache during one of the stalled moments,
which
> hopefully is a useful starting point:
>
>
>
> e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
> mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)
>
>
>
>
>
> -Chris
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com