Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-06-12 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com):
> Hi Stefan,
> 
> Sorry I couldn't get back to you sooner.

NP.

> Looks like you hit the infinite loop bug in OpTracker. It was fixed in
> 12.2.11: https://tracker.ceph.com/issues/37977
> 
> The problem was introduced in 12.2.8.

We've been quite long on 12.2.8 (because of issues with 12.2.9 /
uncertainty 12.2.10) ... We upgraded to 12.2.11 at the end of februari
after we stopped seeing crashes ... so it does correlate with the
upgrade, so yeah, probably this bug then.

Thanks,

Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-05-31 Thread Patrick Donnelly
Hi Stefan,

Sorry I couldn't get back to you sooner.

On Mon, May 27, 2019 at 5:02 AM Stefan Kooman  wrote:
>
> Quoting Stefan Kooman (ste...@bit.nl):
> > Hi Patrick,
> >
> > Quoting Stefan Kooman (ste...@bit.nl):
> > > Quoting Stefan Kooman (ste...@bit.nl):
> > > > Quoting Patrick Donnelly (pdonn...@redhat.com):
> > > > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > > > somewhere it's not even outputting any log messages. If possible, it'd
> > > > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > > > if you're comfortable with gdb, a backtrace of any threads that look
> > > > > suspicious (e.g. not waiting on a futex) including `info threads`.
> > >
> > > Today the issue reappeared (after being absent for ~ 3 weeks). This time
> > > the standby MDS could take over and would not get into a deadlock
> > > itself. We made gdb traces again, which you can find over here:
> > >
> > > https://8n1.org/14011/d444
> >
> > We are still seeing these crashes occur ~ every 3 weeks or so. Have you
> > find the time to look into the backtraces / gdb dumps?
>
> We have not seen this issue anymore for the past three months. We have
> updated the cluster to 12.2.11 in the meantime, but not sure if that is
> related. Hopefully it stays away.

Looks like you hit the infinite loop bug in OpTracker. It was fixed in
12.2.11: https://tracker.ceph.com/issues/37977

The problem was introduced in 12.2.8.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-05-27 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Hi Patrick,
> 
> Quoting Stefan Kooman (ste...@bit.nl):
> > Quoting Stefan Kooman (ste...@bit.nl):
> > > Quoting Patrick Donnelly (pdonn...@redhat.com):
> > > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > > somewhere it's not even outputting any log messages. If possible, it'd
> > > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > > if you're comfortable with gdb, a backtrace of any threads that look
> > > > suspicious (e.g. not waiting on a futex) including `info threads`.
> > 
> > Today the issue reappeared (after being absent for ~ 3 weeks). This time
> > the standby MDS could take over and would not get into a deadlock
> > itself. We made gdb traces again, which you can find over here:
> > 
> > https://8n1.org/14011/d444
> 
> We are still seeing these crashes occur ~ every 3 weeks or so. Have you
> find the time to look into the backtraces / gdb dumps?

We have not seen this issue anymore for the past three months. We have
updated the cluster to 12.2.11 in the meantime, but not sure if that is
related. Hopefully it stays away.

FYI,

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-01-16 Thread Stefan Kooman
Hi Patrick,

Quoting Stefan Kooman (ste...@bit.nl):
> Quoting Stefan Kooman (ste...@bit.nl):
> > Quoting Patrick Donnelly (pdonn...@redhat.com):
> > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > somewhere it's not even outputting any log messages. If possible, it'd
> > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > if you're comfortable with gdb, a backtrace of any threads that look
> > > suspicious (e.g. not waiting on a futex) including `info threads`.
> 
> Today the issue reappeared (after being absent for ~ 3 weeks). This time
> the standby MDS could take over and would not get into a deadlock
> itself. We made gdb traces again, which you can find over here:
> 
> https://8n1.org/14011/d444

We are still seeing these crashes occur ~ every 3 weeks or so. Have you
find the time to look into the backtraces / gdb dumps?

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-11-15 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Quoting Patrick Donnelly (pdonn...@redhat.com):
> > Thanks for the detailed notes. It looks like the MDS is stuck
> > somewhere it's not even outputting any log messages. If possible, it'd
> > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > if you're comfortable with gdb, a backtrace of any threads that look
> > suspicious (e.g. not waiting on a futex) including `info threads`.

Today the issue reappeared (after being absent for ~ 3 weeks). This time
the standby MDS could take over and would not get into a deadlock
itself. We made gdb traces again, which you can find over here:

https://8n1.org/14011/d444

Would be great if someone could figure out whats causing this issue.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-23 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com):
> Thanks for the detailed notes. It looks like the MDS is stuck
> somewhere it's not even outputting any log messages. If possible, it'd
> be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> if you're comfortable with gdb, a backtrace of any threads that look
> suspicious (e.g. not waiting on a futex) including `info threads`.

It took a while before the same issue reappeared again ... but we
managed to catch gdb backtraces and strace output. See below pastebin
links. Note: we had difficulty getting the MDSs working again, so we had
to restart them a couple of times, capturing debug output as much as we
can. Hopefully you can squeeze some useful information out of this data.

MDS1:
https://8n1.org/13869/bc3b - Some few minutes after it first started
acting up
https://8n1.org/13870/caf4 - Probably made when I tried to stop the
process and it took too long (process already received SIGKILL)
https://8n1.org/13871/2f22 - After restarting the same issue returned
https://8n1.org/13872/2246 - After restarting the same issue returned

MDS2:
https://8n1.org/13873/f861 - After it went craycray when it became
active
https://8n1.org/13874/c567 - After restarting the same issue returned
https://8n1.org/13875/133a - After restarting the same issue returned

STRACES:
MDS1: https://8n1.org/mds1-strace.zip
MDS2: https://8n1.org/mds2-strace.zip

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-08 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman  wrote:
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146 
> pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active 
> seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping 
> beacon, heartbeat map not healthy
>
> As far as I can see just normal behaviour.
>
> The big question is: what is happening when the mds start logging the 
> hearbeat_map messages?
> Why does it log "heartbeat_map is_healthy", just to log .04 seconds later 
> it's not healthy?
>
> Ceph version: 12.2.8 on all nodes (mon, osd, mds)
> mds: one active / one standby-replay
>
> The system was not under any kind of resource pressure: plenty of CPU, RAM
> available. Metrics all look normal up to the moment things go into a deadlock
> (so it seems).

Thanks for the detailed notes. It looks like the MDS is stuck
somewhere it's not even outputting any log messages. If possible, it'd
be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
if you're comfortable with gdb, a backtrace of any threads that look
suspicious (e.g. not waiting on a futex) including `info threads`.
-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-08 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
 
> > From what you've described here, it's most likely that the MDS is trying to
> > read something out of RADOS which is taking a long time, and which we
> > didn't expect to cause a slow down. You can check via the admin socket to
> > see if there are outstanding Objecter requests or ops_in_flight to get a
> > clue.

I double checked the load of the OSDs, MONs, MDSs, but that's all
normal. I would expect "slow requests" first, before hitting a timeout of
some sort.
What would cause a MDS spinning and consuming 100% CPU if a request would
be slow?

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-05 Thread Stefan Kooman
Quoting Gregory Farnum (gfar...@redhat.com):
> 
> Ah, there's a misunderstanding here — the output isn't terribly clear.
> "is_healthy" is the name of a *function* in the source code. The line
> 
> heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 
> is telling you that the heartbeat_map's is_healthy function is running, and
> it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread
> MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the
> MDS beacon code decides not to send a beacon, because it thinks the MDS
> might be stuck.

Thanks for the explanation.

> From what you've described here, it's most likely that the MDS is trying to
> read something out of RADOS which is taking a long time, and which we
> didn't expect to cause a slow down. You can check via the admin socket to
> see if there are outstanding Objecter requests or ops_in_flight to get a
> clue.

Hmm, I avoided that because of this issue [1]. Killing the MDS while
debugging why it's hanging is defeating the purpose ;-).

I might check for "Objecter requests".

Thanks!

Gr. Stefan

[1]: http://tracker.ceph.com/issues/26894

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-05 Thread Gregory Farnum
On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman  wrote:

> Dear list,
>
> Today we hit our first Ceph MDS issue. Out of the blue the active MDS
> stopped working:
>
> mon.mon1 [WRN] daemon mds.mds1 is not responding, replacing it as rank 0
> with standby
> daemon mds.mds2.
>
> Logging of ceph-mds1:
>
> 2018-10-04 10:50:08.524745 7fdd516bf700 1 mds.mds1 asok_command: status
> (starting...)
> 2018-10-04 10:50:08.524782 7fdd516bf700 1 mds.mds1 asok_command: status
> (complete)
>
> ^^ one of our monitoring health checks performing a "ceph daemon mds.mds1
> version", business as usual.
>
> 2018-10-04 10:52:36.712525 7fdd51ec0700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:52:36.747577 7fdd4deb8700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:52:36.747584 7fdd4deb8700 1 mds.beacon.mds1 _send skipping
> beacon, heartbeat map not healthy
>
> ^^ the unresponsive mds1 consumes 100% CPU and keeps on logging above
> heatbeat_map messages.
>
> In the meantime ceph-mds2 has transitioned from "standby-replay" to
> "active":
>
> mon.mon1 [INF] daemon mds.mds2 is now active in filesystem
> BITED-153874-cephfs as rank 0
>
> Logging:
>
> replays, final replay as standby, reopen log
>
> 2018-10-04 10:52:53.268470 7fdb231d9700 1 mds.0.141 reconnect_done
> 2018-10-04 10:52:53.759844 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 143 from mon.3
> 2018-10-04 10:52:53.759859 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:53.759862 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2018-10-04 10:52:53.759868 7fdb231d9700 1 mds.0.141 rejoin_start
> 2018-10-04 10:52:53.759970 7fdb231d9700 1 mds.0.141 rejoin_joint_start
> 2018-10-04 10:52:53.760970 7fdb1d1cd700 0 mds.0.cache failed to open ino
> 0x1cd95e9 err -5/0
> 2018-10-04 10:52:54.126658 7fdb1d1cd700 1 mds.0.141 rejoin_done
> 2018-10-04 10:52:54.770457 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 144 from mon.3
> 2018-10-04 10:52:54.770484 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:54.770487 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:rejoin --> up:clientreplay
> 2018-10-04 10:52:54.770494 7fdb231d9700 1 mds.0.141 recovery_done --
> successful recovery!
> 2018-10-04 10:52:54.770617 7fdb231d9700 1 mds.0.141 clientreplay_start
> 2018-10-04 10:52:54.882995 7fdb1d1cd700 1 mds.0.141 clientreplay_done
> 2018-10-04 10:52:55.778598 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 145 from mon.3
> 2018-10-04 10:52:55.778622 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:55.778628 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:clientreplay --> up:active
> 2018-10-04 10:52:55.778638 7fdb231d9700 1 mds.0.141 active_start
> 2018-10-04 10:52:55.805206 7fdb231d9700 1 mds.0.141 cluster recovered.
>
> And then it _also_ starts to log hearbeat_map messages (and consuming 100%
> CPU):
>
> en dan deze meldingen die zichzelf blijven herhalen bij 100% cpu
> 2018-10-04 10:53:41.550793 7fdb241db700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:53:42.884018 7fdb201d3700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:53:42.884024 7fdb201d3700 1 mds.beacon.mds2 _send skipping
> beacon, heartbeat map not healthy
>
> At that point in time there is one active MDS according to ceph, but in
> reality it's
> not functioning correctly (not serving clients at least).
>
> ... we stopped both daemons. Restarted one ... recovery ...
> works for half a minute ... then starts logging heartbeat_map messages.
> Restart again ... works for a little while ... starts logging
> heartbeat_map messages again. We restart the mds with debug_mds=20 
> it keeps working fine. The other mds gets restarted and keeps on
> working. We do a couple of failover tests  works flawlessly
> (failover in < 1 second, clients reconnect instantly).
>
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146
> pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 23:24:11.088871