Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-27 Thread thoralf schulze
hi Zheng,

On 8/26/19 3:31 PM, Yan, Zheng wrote:

[…]
> change code to :
[…]

we can happily confirm that this resolves the issue.

thank you _very_ much & with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-26 Thread Yan, Zheng
On Mon, Aug 26, 2019 at 9:25 PM thoralf schulze  wrote:
>
> hi Zheng -
>
> On 8/26/19 2:55 PM, Yan, Zheng wrote:
> > I tracked down the bug
> > https://tracker.ceph.com/issues/41434
>
> wow, that was quick - thank you for investigating. we are looking
> forward for the fix :-)
>
> in the meantime, is there anything we can do to prevent q ==
> p->second.end() from happening? something along the lines of
>
> --- a/src/mds/Locker.cc
> +++ b/src/mds/Locker.cc
> @@ -1867,6 +1867,8 @@ void Locker::file_update_finish(CInode *in,
> MutationRef& mut, unsigned flags,
>  auto p = in->client_snap_caps.begin();
>  while (p != in->client_snap_caps.end()) {
>auto q = p->second.find(client);
> +  // horrible hack for zecm
> +  std::this_thread::sleep_for(std::chrono::milliseconds(1));
>if (q != p->second.end()) {
> SimpleLock *lock = in->get_lock(p->first);
> ceph_assert(lock);
>
> thank you very much & with kind regards,
> t.
>
change code to :

if (q != p->second.end()) {

} else {
++p;
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-26 Thread thoralf schulze
hi Zheng -

On 8/26/19 2:55 PM, Yan, Zheng wrote:
> I tracked down the bug
> https://tracker.ceph.com/issues/41434

wow, that was quick - thank you for investigating. we are looking
forward for the fix :-)

in the meantime, is there anything we can do to prevent q ==
p->second.end() from happening? something along the lines of

--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -1867,6 +1867,8 @@ void Locker::file_update_finish(CInode *in,
MutationRef& mut, unsigned flags,
 auto p = in->client_snap_caps.begin();
 while (p != in->client_snap_caps.end()) {
   auto q = p->second.find(client);
+  // horrible hack for zecm
+  std::this_thread::sleep_for(std::chrono::milliseconds(1));
   if (q != p->second.end()) {
SimpleLock *lock = in->get_lock(p->first);
ceph_assert(lock);

thank you very much & with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-26 Thread Yan, Zheng
On Mon, Aug 26, 2019 at 6:57 PM thoralf schulze  wrote:
>
> hi Zheng,
>
> On 8/21/19 4:32 AM, Yan, Zheng wrote:
> > Please enable debug mds (debug_mds=10), and try reproducing it again.
>
> please find the logs at
> https://www.user.tu-berlin.de/thoralf.schulze/ceph-debug.tar.xz .
>
> we managed to reproduce the issue as a worst case scenario: before
> snapshotting, juju-d0f708-5-lxd-1 and juju-d0f708-10-lxd-1 were the
> active mds's and juju-d0f708-3-lxd-1 and juju-d0f708-9-lxd-1 standbys.
> we created the snapshot at ~08:11:50, a little later the failover
> happened and juju-d0f708-5-lxd-1 and juju-d0f708-10-lxd-1 went mia. a
> little later still, the now-active juju-d0f708-3-lxd-1 and
> juju-d0f708-9-lxd-1 mds's dropped out of the cluster as well. we started
> to restart all mds daemons at ~08:16.
>
> thank you very much & with kind regards,
> t.
>

I tracked down the bug
https://tracker.ceph.com/issues/41434

Thanks
Yan, Zheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-26 Thread thoralf schulze
hi Zheng,

On 8/21/19 4:32 AM, Yan, Zheng wrote:
> Please enable debug mds (debug_mds=10), and try reproducing it again.

please find the logs at
https://www.user.tu-berlin.de/thoralf.schulze/ceph-debug.tar.xz .

we managed to reproduce the issue as a worst case scenario: before
snapshotting, juju-d0f708-5-lxd-1 and juju-d0f708-10-lxd-1 were the
active mds's and juju-d0f708-3-lxd-1 and juju-d0f708-9-lxd-1 standbys.
we created the snapshot at ~08:11:50, a little later the failover
happened and juju-d0f708-5-lxd-1 and juju-d0f708-10-lxd-1 went mia. a
little later still, the now-active juju-d0f708-3-lxd-1 and
juju-d0f708-9-lxd-1 mds's dropped out of the cluster as well. we started
to restart all mds daemons at ~08:16.

thank you very much & with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-21 Thread thoralf schulze
hi zheng,

On 8/21/19 4:32 AM, Yan, Zheng wrote:
> Please enable debug mds (debug_mds=10), and try reproducing it again.

we will get back with the logs on monday.

thank you & with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-20 Thread Yan, Zheng
On Tue, Aug 20, 2019 at 9:43 PM thoralf schulze  wrote:
>
> hi there,
>
> we are struggling with the creation of cephfs-snapshots: doing so
> reproducible causes a failover of our metadata servers. afterwards, the
> demoted mds servers won't be available as standby servers and the mds
> daemons on these machines have to be manually restarted. more often than
> we wish, the failover fails altogether, resulting in an unresponsive cephfs.
>


Please enable debug mds (debug_mds=10), and try reproducing it again.

Regards
Yan, Zheng

> this is with mimic 13.2.6 and a single cephfs. we are running 4 mds
> servers with plenty of cpu and ram ressources in a multi-active setup
> with 2 active and 2 standby mds's:
>
> mds: ceph-fs-2/2/2 up
> {0=juju-d0f708-3-lxd-1=up:active,1=juju-d0f708-5-lxd-1=up:active}, 2
> up:standby-replay
>
> is the transition from active to the standby mds servers intended? and
> if not: how can we prevent those?
> we could live with the failover if the ex-active mds's would still be a
> part of the cluster afterwards, but this is not the case. on top of
> that, the failover is not 100% reliable - if it fails, the newly active
> mds's exhibit the same symptoms as the failed ones: they just sit around
> complaining about "MDS internal heartbeat is not healthy!".
>
> strace'ing the mds processes on the ex-active mds shows that they are
> mostly waiting for some futex'es to become available. we also found that
> the issue gets alleviated a bit by rising mds_cache_memory_limit from
> its default of 1gb to 32gb - in this case, the failover has a higher
> chance to succeed.
>
> below are some logs from a successful failover - juju-d0f708-9-lxd-1 and
> juju-d0f708-10-lxd-1 were the active mds's and juju-d0f708-3-lxd-1 and
> juju-d0f708-5-lxd-1 the standbys. sorry for being very verbose, i don't
> want to withhold any information that might be necessary to debug this
> issue … if it helps, i can also provide the logs for
> juju-d0f708-10-lxd-1 and juju-d0f708-5-lxd-1 and the output of ceph
> daemon perf dump on all mds's before and after the issue occurs.
>
> thank you very much & with kind regards,
> t.
>
> --- logs ---
>
> ceph-mgr.log on a mon:
>
> 2019-08-20 09:18:23.642 7f5679639700  0 ms_deliver_dispatch: unhandled
> message 0x5614f2184000 mgrreport(mds.juju-d0f708-9-lxd-1 +0-0 packed
> 1374) v6 from mds.0 172.28.9.20:6800/2237168008
> 2019-08-20 09:18:23.646 7f5684835700  1 mgr finish mon failed to return
> metadata for mds.juju-d0f708-9-lxd-1: (22) Invalid argument
> 2019-08-20 09:18:55.781 7f5679639700  0 ms_deliver_dispatch: unhandled
> message 0x5614f256e700 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed
> 1374) v6 from mds.1 172.28.9.21:6800/2008779502
> 2019-08-20 09:18:55.781 7f5684835700  1 mgr finish mon failed to return
> metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
> 2019-08-20 09:21:26.562 7f5679639700  0 ms_deliver_dispatch: unhandled
> message 0x5614f8780a00 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
> v6 from mds.? 172.28.9.21:6800/1319885328
> 2019-08-20 09:21:26.562 7f5684835700  1 mgr finish mon failed to return
> metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
> 2019-08-20 09:21:27.558 7f5679639700  0 ms_deliver_dispatch: unhandled
> message 0x5614f5666a00 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
> v6 from mds.? 172.28.9.21:6800/1319885328
> 2019-08-20 09:21:27.562 7f5684835700  1 mgr finish mon failed to return
> metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
> 2019-08-20 09:21:28.558 7f5679639700  0 ms_deliver_dispatch: unhandled
> message 0x5614f2e3ee00 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
> v6 from mds.? 172.28.9.21:6800/1319885328
> 2019-08-20 09:21:28.562 7f5684835700  1 mgr finish mon failed to return
> metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
> 2019-08-20 09:21:29.558 7f5679639700  0 ms_deliver_dispatch: unhandled
> message 0x5614efa71880 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
> v6 from mds.? 172.28.9.21:6800/1319885328
> [… more of these]
>
> ceph-mds.log on juju-d0f708-9-lxd-1 (ex active rank 0):
>
> 2019-08-20 09:17:44.824 7f7d3a138700  5 mds.beacon.juju-d0f708-9-lxd-1
> Sending beacon up:active seq 59
> 2019-08-20 09:17:44.824 7f7d3fb2b700  5 mds.beacon.juju-d0f708-9-lxd-1
> received beacon reply up:active seq 59 rtt 0
> 2019-08-20 09:17:45.020 7f7d3db27700  4 mds.0.server
> handle_client_request client_request(client.89193:898963 getattr
> pAsLsXsFs #0x100221e 2019-08-20 09:17:45
> .021386 caller_uid=0, caller_gid=0{}) v2
> [… lots of these …]
> 2019-08-20 09:17:45.092 7f7d36130700  5 mds.0.log _submit_thread
> 109124009784~1190 : EUpdate cap update [metablob 0x14c, 1 dirs]
> 2019-08-20 09:17:45.092 7f7d3db27700  4 mds.0.server
> handle_client_request client_request(client.49306:901053 getattr
> pAsLsXsFs #0x100221f 2019-08-20 09:17:45
> .093085 caller_uid=0, caller_gid=0{}) v2
> [… lots of these …]
> 2019-08-20 09:17:45.260 7f7d36130700  5 mds

[ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-20 Thread thoralf schulze
hi there,

we are struggling with the creation of cephfs-snapshots: doing so
reproducible causes a failover of our metadata servers. afterwards, the
demoted mds servers won't be available as standby servers and the mds
daemons on these machines have to be manually restarted. more often than
we wish, the failover fails altogether, resulting in an unresponsive cephfs.

this is with mimic 13.2.6 and a single cephfs. we are running 4 mds
servers with plenty of cpu and ram ressources in a multi-active setup
with 2 active and 2 standby mds's:

mds: ceph-fs-2/2/2 up
{0=juju-d0f708-3-lxd-1=up:active,1=juju-d0f708-5-lxd-1=up:active}, 2
up:standby-replay

is the transition from active to the standby mds servers intended? and
if not: how can we prevent those?
we could live with the failover if the ex-active mds's would still be a
part of the cluster afterwards, but this is not the case. on top of
that, the failover is not 100% reliable - if it fails, the newly active
mds's exhibit the same symptoms as the failed ones: they just sit around
complaining about "MDS internal heartbeat is not healthy!".

strace'ing the mds processes on the ex-active mds shows that they are
mostly waiting for some futex'es to become available. we also found that
the issue gets alleviated a bit by rising mds_cache_memory_limit from
its default of 1gb to 32gb - in this case, the failover has a higher
chance to succeed.

below are some logs from a successful failover - juju-d0f708-9-lxd-1 and
juju-d0f708-10-lxd-1 were the active mds's and juju-d0f708-3-lxd-1 and
juju-d0f708-5-lxd-1 the standbys. sorry for being very verbose, i don't
want to withhold any information that might be necessary to debug this
issue … if it helps, i can also provide the logs for
juju-d0f708-10-lxd-1 and juju-d0f708-5-lxd-1 and the output of ceph
daemon perf dump on all mds's before and after the issue occurs.

thank you very much & with kind regards,
t.

--- logs ---

ceph-mgr.log on a mon:

2019-08-20 09:18:23.642 7f5679639700  0 ms_deliver_dispatch: unhandled
message 0x5614f2184000 mgrreport(mds.juju-d0f708-9-lxd-1 +0-0 packed
1374) v6 from mds.0 172.28.9.20:6800/2237168008
2019-08-20 09:18:23.646 7f5684835700  1 mgr finish mon failed to return
metadata for mds.juju-d0f708-9-lxd-1: (22) Invalid argument
2019-08-20 09:18:55.781 7f5679639700  0 ms_deliver_dispatch: unhandled
message 0x5614f256e700 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed
1374) v6 from mds.1 172.28.9.21:6800/2008779502
2019-08-20 09:18:55.781 7f5684835700  1 mgr finish mon failed to return
metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
2019-08-20 09:21:26.562 7f5679639700  0 ms_deliver_dispatch: unhandled
message 0x5614f8780a00 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
v6 from mds.? 172.28.9.21:6800/1319885328
2019-08-20 09:21:26.562 7f5684835700  1 mgr finish mon failed to return
metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
2019-08-20 09:21:27.558 7f5679639700  0 ms_deliver_dispatch: unhandled
message 0x5614f5666a00 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
v6 from mds.? 172.28.9.21:6800/1319885328
2019-08-20 09:21:27.562 7f5684835700  1 mgr finish mon failed to return
metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
2019-08-20 09:21:28.558 7f5679639700  0 ms_deliver_dispatch: unhandled
message 0x5614f2e3ee00 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
v6 from mds.? 172.28.9.21:6800/1319885328
2019-08-20 09:21:28.562 7f5684835700  1 mgr finish mon failed to return
metadata for mds.juju-d0f708-10-lxd-1: (22) Invalid argument
2019-08-20 09:21:29.558 7f5679639700  0 ms_deliver_dispatch: unhandled
message 0x5614efa71880 mgrreport(mds.juju-d0f708-10-lxd-1 +0-0 packed 6)
v6 from mds.? 172.28.9.21:6800/1319885328
[… more of these]

ceph-mds.log on juju-d0f708-9-lxd-1 (ex active rank 0):

2019-08-20 09:17:44.824 7f7d3a138700  5 mds.beacon.juju-d0f708-9-lxd-1
Sending beacon up:active seq 59
2019-08-20 09:17:44.824 7f7d3fb2b700  5 mds.beacon.juju-d0f708-9-lxd-1
received beacon reply up:active seq 59 rtt 0
2019-08-20 09:17:45.020 7f7d3db27700  4 mds.0.server
handle_client_request client_request(client.89193:898963 getattr
pAsLsXsFs #0x100221e 2019-08-20 09:17:45
.021386 caller_uid=0, caller_gid=0{}) v2
[… lots of these …]
2019-08-20 09:17:45.092 7f7d36130700  5 mds.0.log _submit_thread
109124009784~1190 : EUpdate cap update [metablob 0x14c, 1 dirs]
2019-08-20 09:17:45.092 7f7d3db27700  4 mds.0.server
handle_client_request client_request(client.49306:901053 getattr
pAsLsXsFs #0x100221f 2019-08-20 09:17:45
.093085 caller_uid=0, caller_gid=0{}) v2
[… lots of these …]
2019-08-20 09:17:45.260 7f7d36130700  5 mds.0.log _submit_thread
109124010994~107 : ETableServer snaptable prepare reqid 2 mds.0 tid 98
version 98 mutation=43 bytes
2019-08-20 09:17:45.264 7f7d36130700  5 mds.0.log _submit_thread
109124011121~11940 : EUpdate mksnap [metablob 0x1, 1 dirs table_tids=^A,98]
2019-08-20 09:17:45.272 7f7d36130700  5 mds.0.log _submit_thread