Ok. I just ran into this issue again. The mds rolled after many clients were
failing to relieve cache pressure.
Now here is the result of `ceph –s`
# ceph -s
cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
health HEALTH_OK
monmap e1: 5 mons at
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3
up:standby
osdmap e118886: 192 osds: 192 up, 192 in
pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
69601 GB used, 37656 GB / 104 TB avail
11309 active+clean
13 active+clean+scrubbing
6 active+clean+scrubbing+deep
And here are the ops in flight:
# ceph daemon mds.a190 dump_ops_in_flight
{
"ops": [],
"num_ops": 0
}
And a tail of the active mds log at debug_mds 5/5
2016-09-21 20:15:53.354226 7fce3b626700 4 mds.0.server handle_client_request
client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21
20:15:53.352390) v2
2016-09-21 20:15:53.354234 7fce3b626700 5 mds.0.server session
closed|closing|killing, dropping
2016-09-21 20:15:54.867108 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 235) v1 from client.507429717
2016-09-21 20:15:54.980907 7fce3851f700 2 mds.0.cache check_memory_usage total
1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers
0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
2016-09-21 20:15:54.980960 7fce3851f700 5 mds.0.bal mds.0 epoch 38 load
mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
2016-09-21 20:15:55.247885 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 233) v1 from client.538555196
2016-09-21 20:15:55.455566 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 365) v1 from client.507390467
2016-09-21 20:15:55.807704 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 367) v1 from client.538485341
2016-09-21 20:15:56.243462 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 189) v1 from client.538577596
2016-09-21 20:15:56.986901 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 232) v1 from client.507430372
2016-09-21 20:15:57.026206 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.491885158
2016-09-21 20:15:57.369281 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.507390682
2016-09-21 20:15:57.445687 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.538485996
2016-09-21 20:15:57.579268 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.538486021
2016-09-21 20:15:57.595568 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.507390702
2016-09-21 20:15:57.604356 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.507390712
2016-09-21 20:15:57.693546 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.507390717
2016-09-21 20:15:57.819536 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.491885168
2016-09-21 20:15:57.894058 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.507390732
2016-09-21 20:15:57.983329 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.507390742
2016-09-21 20:15:58.077915 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.538486031
2016-09-21 20:15:58.141710 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.491885178
2016-09-21 20:15:58.159134 7fce3b626700 3 mds.0.server handle_client_session
client_session(request_renewcaps seq 364) v1 from client.491885188
-Chris
On 9/21/16, 11:23 AM, "Heller, Chris" <[email protected]> wrote:
Perhaps related, I was watching the active mds with debug_mds set to 5/5,
when I saw this in the log:
2016-09-21 15:13:26.067698 7fbaec248700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0
c=0x5631ce0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e8420).fault with nothing to send, going to standby
2016-09-21 15:13:26.067725 7fbb0098e700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0
c=0x15c29020).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067743 7fbb026ab700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e91e0).fault, server,
going to standby
2016-09-21 15:13:26.067749 7fbae840a700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0
c=0x13b6c160).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067783 7fbadb239700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0
c=0x28e24f20).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067803 7fbafe66b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e9760).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067876 7fbb01a9f700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92f5a0).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067886 7fbb01ca1700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cc60).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067865 7fbaf43c9700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0
c=0x37c902c0).fault with no
thing to send, going to standby
2016-09-21 15:13:26.067910 7fbaf4ed4700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0
c=0x56a5440).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cf20).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0
c=0x31f7a100).fault with no
thing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e8420).fault w[0/9326]ing to send, going to standby
2016-09-21 15:13:26.067725 7fbb0098e700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0
c=0x15c29020).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067743 7fbb026ab700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e91e0).fault, server,
going to standby
2016-09-21 15:13:26.067749 7fbae840a700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.214:0/3290400005 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0
c=0x13b6c160).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067783 7fbadb239700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0
c=0x28e24f20).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067803 7fbafe66b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0
c=0x237e9760).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067876 7fbb01a9f700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92f5a0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067886 7fbb01ca1700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cc60).fault with nothing to send, going to standby
2016-09-21 15:13:26.067865 7fbaf43c9700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0
c=0x37c902c0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067910 7fbaf4ed4700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.236:0/650394434 pipe(0x2fbe0000 sd=116 :6802 s=2 pgs=2 cs=1 l=0
c=0x56a5440).fault with nothing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0
c=0x2f92cf20).fault with nothing to send, going to standby
2016-09-21 15:13:26.068076 7fbafc64b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0
c=0x31f7a100).fault with nothing to send, going to standby
2016-09-21 15:13:26.068095 7fbafff84700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.140:0/1112150414 pipe(0x5679000 sd=125 :6802 s=2 pgs=2 cs=1 l=0
c=0x41bc7e0).fault with nothing to send, going to standby
2016-09-21 15:13:26.068108 7fbb0de0e700 5 mds.0.953 handle_mds_map epoch
8471 from mon.3
2016-09-21 15:13:26.068114 7fbaf890e700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.238:0/1422203298 pipe(0x29630000 sd=44 :6802 s=2 pgs=2 cs=1 l=0
c=0x3a740dc0).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068143 7fbae860c700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.217:0/1120082018 pipe(0x2a724000 sd=121 :6802 s=2 pgs=2 cs=1 l=0
c=0x31f79e40).fault with no
thing to send, going to standby
2016-09-21 15:13:26.068190 7fbb040c5700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.218:0/3945638891 pipe(0x50c0000 sd=53 :6802 s=2 pgs=2 cs=1 l=0
c=0x56f4420).fault with nothi
ng to send, going to standby
2016-09-21 15:13:26.068200 7fbaf961b700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.144:0/2952053583 pipe(0x318dc000 sd=81 :6802 s=2 pgs=2 cs=1 l=0
c=0x286fa840).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068232 7fbaf981d700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.159:0/1872775873 pipe(0x268d7000 sd=38 :6802 s=2 pgs=2 cs=1 l=0
c=0x56f6940).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.068253 7fbaeac32700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.186:0/4141441999 pipe(0x54e7000 sd=86 :6802 s=2 pgs=2 cs=1 l=0
c=0x286fb760).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.068275 7fbb0de0e700 1 mds.-1.-1 handle_mds_map i
(192.168.1.196:6802/13581) dne in the mdsmap, respawning myself
2016-09-21 15:13:26.068289 7fbb0de0e700 1 mds.-1.-1 respawn
2016-09-21 15:13:26.068294 7fbb0de0e700 1 mds.-1.-1 e: 'ceph-mds'
2016-09-21 15:13:26.173095 7f689baa8780 0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mds, pid 13581
2016-09-21 15:13:26.175664 7f689baa8780 -1 mds.-1.0 log_to_monitors
{default=true}
2016-09-21 15:13:27.329181 7f68969e9700 1 mds.-1.0 handle_mds_map standby
2016-09-21 15:13:28.484148 7f68969e9700 1 mds.-1.0 handle_mds_map standby
2016-09-21 15:13:33.280376 7f68969e9700 1 mds.-1.0 handle_mds_map standby
On 9/21/16, 10:48 AM, "Heller, Chris" <[email protected]> wrote:
I’ll see if I can capture the output the next time this issue arises,
but in general the output looks as if nothing is wrong. No OSD are down, a
‘ceph health detail’ results in HEALTH_OK, the mds server is in the up:active
state, in general it’s as if nothing is wrong server side (at least from the
summary).
-Chris
On 9/21/16, 10:46 AM, "Gregory Farnum" <[email protected]> wrote:
On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <[email protected]>
wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been
seeing a
> periodic issue arise where in all my MDS clients will become
stuck, and the
> fix so far has been to restart the active MDS (sometimes I need
to restart
> the subsequent active MDS as well).
>
>
>
> These clients are using the cephfs-hadoop API, so there is no
kernel client,
> or fuse api involved. When I see clients get stuck, there are
messages
> printed to stderr like the following:
>
>
>
> 2016-09-21 10:31:12.285030 7fea4c7fb700 0 –
192.168.1.241:0/1606648601 >>
> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0
cs=0 l=0
> c=0x7feaa0a0c500).fault
>
>
>
> I’m at somewhat of a loss on where to begin debugging this issue,
and wanted
> to ping the list for ideas.
What's the full output of "ceph -s" when this happens? Have you
looked
at the MDS' admin socket's ops-in-flight, and that of the clients?j
http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help
some as well.
>
>
>
> I managed to dump the mds cache during one of the stalled
moments, which
> hopefully is a useful starting point:
>
>
>
> e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
> mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)
>
>
>
>
>
> -Chris
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com