Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

Oliver Freyermuth Mon, 16 Apr 2018 08:31:03 -0700

Am 16.04.2018 um 08:58 schrieb Oliver Freyermuth:
> Am 16.04.2018 um 02:43 schrieb Oliver Freyermuth:
>> Am 15.04.2018 um 23:04 schrieb John Spray:
>>> On Fri, Apr 13, 2018 at 5:16 PM, Oliver Freyermuth
>>> <[email protected]> wrote:
>>>> Dear Cephalopodians,
>>>>
>>>> in our cluster (CentOS 7.4, EC Pool, Snappy compression, Luminous 12.2.4),
>>>> we often have all (~40) clients accessing one file in readonly mode, even 
>>>> with multiple processes per client doing that.
>>>>
>>>> Sometimes (I do not yet know when, nor why!) the MDS ends up in a 
>>>> situation like:
>>>> -----------------------------------------------------------
>>>> 2018-04-13 18:08:34.378888 7f1ce4472700  0 log_channel(cluster) log [WRN] 
>>>> : 292 slow requests, 5 included below; oldest blocked for > 1745.864417 
>>>> secs
>>>> 2018-04-13 18:08:34.378900 7f1ce4472700  0 log_channel(cluster) log [WRN] 
>>>> : slow request 960.563534 seconds old, received at 2018-04-13 
>>>> 17:52:33.815273: client_request(client.34720:16487379 getattr pAsLsXsFs 
>>>> #0x1000009ff6d 2018-04-13 17:52:33.814904 caller_uid=94894, 
>>>> caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378904 7f1ce4472700  0 log_channel(cluster) log [WRN] 
>>>> : slow request 30.636678 seconds old, received at 2018-04-13 
>>>> 18:08:03.742128: client_request(client.34302:16453640 getattr pAsLsXsFs 
>>>> #0x1000009ff6d 2018-04-13 18:08:03.741630 caller_uid=94894, 
>>>> caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378908 7f1ce4472700  0 log_channel(cluster) log [WRN] 
>>>> : slow request 972.648926 seconds old, received at 2018-04-13 
>>>> 17:52:21.729881: client_request(client.34720:16487334 lookup 
>>>> #0x1000001fcab/sometarball.tar.gz 2018-04-13 17:52:21.729450 
>>>> caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378913 7f1ce4472700  0 log_channel(cluster) log [WRN] 
>>>> : slow request 1685.953657 seconds old, received at 2018-04-13 
>>>> 17:40:28.425149: client_request(client.34810:16564864 lookup 
>>>> #0x1000001fcab/sometarball.tar.gz 2018-04-13 17:40:28.424961 
>>>> caller_uid=94894, caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> 2018-04-13 18:08:34.378918 7f1ce4472700  0 log_channel(cluster) log [WRN] 
>>>> : slow request 1552.743795 seconds old, received at 2018-04-13 
>>>> 17:42:41.635012: client_request(client.34302:16453566 getattr pAsLsXsFs 
>>>> #0x1000009ff6d 2018-04-13 17:42:41.634726 caller_uid=94894, 
>>>> caller_gid=513{513,}) currently failed to rdlock, waiting
>>>> -----------------------------------------------------------
>>>> As you can see (oldest blocked for > 1745.864417 secs) it stays in that 
>>>> situation for quite a while.
>>>> The number of blocked requests is also not decreasing, but instead slowly 
>>>> increasing whenever a new request is added to the queue.
>>>>
>>>> We have a setup of one active MDS, a standby-replay, and a standby.
>>>> Triggering a failover does not help, it only resets the "oldest blocked" 
>>>> time.
>>>
>>> Sounds like a client issue (a client is holding a lock on a file but
>>> failing to relinquish it for another client's request to be
>>> processed).
>>>
>>> Are these kernel (and what version?) or fuse clients?
>>
>> The full cluster is running with Fuse clients, all on 12.2.4. 
>> Additionally, there are 6 NFS Ganesha servers using FSAL_CEPH, i.e. 
>> libcephfs. 
>>
>> Cheers,
>>      Oliver
> 
> Related to that: In case it happens again, which it surely will sooner or 
> later, how can I diagnose
> which client is holding a lock and not relinquishing it? 
> Is there a way to dump all held locks, ideally with the time period how long 
> they were held? 
> 
> ceph daemon mds.XXX help
> does not yield anything obvious. 
> 
> Cheers and thanks,
>       Oliver


It happened again - and I managed to track down the client causing it: 
Dumping all operations in progress on the MDS, and looking for machines without 
operations which were waiting for an rdlock,
one NFS Ganesha server was not affected. 
Unmounting the NFS share on the clients did not have any effect - things stayed 
in "stuck" state. 

I then stopped the NFS Ganesha server. This took quite a while, until systemd 
noticed it's taking too long, and force-killed the server. 
Shortly after, even before the NFS Ganesha libchepfs client was evicted, the 
MDS got unstuck, processed all requests within about 2 seconds, and everything 
was fine again. 

After that, the dead NFS Ganesha client was evicted, and I just restarted the 
NFS Ganesha server. 

Since this appears to be reproducible (even though we still do not know how 
exactly), this is a rather ugly issue. It seems NFS Ganesha with libcephfs is 
holding locks
and not returning them. We are using the packages from:
https://eu.ceph.com/nfs-ganesha/
Newer versions are available upstream, but I don't know if they fix this 
critial issue. 

Is there somebody else with experience on that? 

As far as I can reproduce, the I/O pattern is:
1. Machine with NFS-mount checks datafolder/sometarball.tar.gz exists. 
2. Machines in the cluster (with ceph-fuse) all access 
datafolder/sometarball.tar.gz (and extract it to a local filesystem)

(1) may happen several times while (2) is going on, and it seems this somehow 
causes the NFS-server to (sometimes?) keep a persistent lock which is never 
relinquished. 

Any help appreciated!
Oliver


> 
>>
>>>
>>> John
>>>
>>>>
>>>> I checked the following things on the active MDS:
>>>> -----------------------------------------------------------
>>>> # ceph daemon mds.mon001 objecter_requests
>>>> {
>>>>     "ops": [],
>>>>     "linger_ops": [],
>>>>     "pool_ops": [],
>>>>     "pool_stat_ops": [],
>>>>     "statfs_ops": [],
>>>>     "command_ops": []
>>>> }
>>>> # ceph daemon mds.mon001 ops | grep event | grep -v "initiated" | grep -v 
>>>> "failed to rdlock" | grep -v events
>>>> => no output, only "initiated" and "rdlock" are in the queue.
>>>> -----------------------------------------------------------
>>>>
>>>> There's also almost no CPU load, almost no other I/O, and ceph is 
>>>> deep-scrubbing ~pg (this also finishes and the next pg is scrubbed fine),
>>>> and the scrubbing is not even happening in the metadata pool (easy to see 
>>>> in the Luminous dashboard):
>>>> -----------------------------------------------------------
>>>> # ceph -s
>>>>   cluster:
>>>>     id:     some_funny_hash
>>>>     health: HEALTH_WARN
>>>>             1 MDSs report slow requests
>>>>
>>>>   services:
>>>>     mon: 3 daemons, quorum mon003,mon001,mon002
>>>>     mgr: mon001(active), standbys: mon002, mon003
>>>>     mds: cephfs_baf-1/1/1 up  {0=mon001=up:active}, 1 up:standby-replay, 1 
>>>> up:standby
>>>>     osd: 196 osds: 196 up, 196 in
>>>>
>>>>   data:
>>>>     pools:   2 pools, 4224 pgs
>>>>     objects: 15649k objects, 61761 GB
>>>>     usage:   114 TB used, 586 TB / 700 TB avail
>>>>     pgs:     4223 active+clean
>>>>              1    active+clean+scrubbing+deep
>>>>
>>>>   io:
>>>>     client:   175 kB/s rd, 3 op/s rd, 0 op/s wr
>>>> -----------------------------------------------------------
>>>>
>>>> Does anybody have any idea what's going on here?
>>>>
>>>> Yesterday, this also happened, but resolved itself after about 1 hour.
>>>> Right now, it's going on for about half an hour...
>>>>
>>>> Cheers,
>>>>         Oliver
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS MDS stuck (failed to rdlock when getattr / lookup)

Reply via email to