Hi all,

Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors 
that we struggle to handle in a non-intrusive manner. Restarting MDS + 
hypervisor that accessed the object in question seems to be the only way we can 
clear the error so we can repair the PG and recover access. Any pointers on how 
to handle this issue in a more gentle way than rebooting the hypervisor and 
failing the MDS would be welcome!


The problem seems to only affect one specific pool (id 42) that is used for 
cephfs_data. This pool is our second cephfs data pool in this cluster. The data 
in the pool is accessible via LXC container via Samba and have the cephfs 
filesystem bind-mounted from hypervisor.

Ceph is recently updated to version 16.2.11 (pacific) -- kernel version is 
5.13.19-6-pve on OSD-hosts/samba-containers and 5.19.17-2-pve on MDS-hosts.


The following warnings are issued:
$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; Too many 
repaired reads on 1 OSDs; Degraded data redundancy: 1/2648430
090 objects degraded (0.000%), 1 pg degraded; 1 slow ops, oldest one blocked 
for 608 sec, osd.34 has slow ops
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
release
    mds.hk-cephnode-65(mds.0): Client hk-cephnode-56 failing to respond to 
capability release client_id: 9534859837
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
    osd.34 had 9936 reads repaired
[WRN] PG_DEGRADED: Degraded data redundancy: 1/2648430090 objects degraded 
(0.000%), 1 pg degraded
    pg 42.e2 is active+recovering+degraded+repair, acting [34,275,284]
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 608 sec, osd.34 has slow ops



The logs for OSD.34 are flooded with these messages:
root@hk-cephnode-53:~# tail /var/log/ceph/ceph-osd.34.log
2023-04-26T11:41:00.760+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try 
copies on 275,284
2023-04-26T11:41:00.784+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try 
copies on 275,284
2023-04-26T11:41:00.824+0200 7f03a821f700 -1 osd.34 1352563 get_health_metrics 
reporting 1 slow ops, oldest is osd_op(client.9534859837.0:20412906 42.e2 
42:4703efac:::10003d86a99.00000001:head [read 0~1048576 [307@0] out=1048576b] 
snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e1352553)
2023-04-26T11:41:00.824+0200 7f03a821f700  0 log_channel(cluster) log [WRN] : 1 
slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'qa-cephfs_data' 
: 1 ])
2023-04-26T11:41:00.840+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try 
copies on 275,284
2023-04-26T11:41:00.888+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on 
42:4703efac:::10003d86a99.00000001:head



We have tried the following:
 - Restarting the OSD in question clears the error for a few seconds but then 
we also we get OSD_TOO_MANY_REPAIRS on OSDs with PGs that holds the object that 
have blocked I/O.

 - Trying to repair the PG seems to restart every 10 second and not actually do 
anything/progressing. (Is there a way to check repair progress?)

 - Restarting the MDS and hypervisor clears the error (the hypervisor hangs for 
several minutes before timing out). However if the object is requested again 
the error reoccurs. If we don't access the object we are able to eventually 
repair the PG.

 - Occasionally setting the primary-affinity to 0 for the primary OSD in the PG 
clears the error after restarting all affected OSD and we are able to repair 
the PG (unless the object is accessed during recovery) and access to the object 
is OK afterwards.

 - Finding and deleting the file pointing to the object (10003d86a99) and 
restarting OSDs will clear the error.

 - Killing the samba process that accessed the object does not clear the 
SLOW_OPS, and hence the error prevail

 - Normal scrubs have revealed a handfull of other PGs in the same pool (id 42) 
that are damaged and we are doing repairs without any problems.

 - We believe MDS_CLIENT_LATE_RELEASE and SLOW_OPS errors are symptoms of the 
fact that the I/O are blocked.

 - We have verified that there are no SMART errors of any kind on any of our 
disks in the cluster.

 - If we don't handle this issue rather promptly, we experience full lockup of 
the samba container and rebooting hypervisor seems to be the only cure. Trying 
to force unmount and remount cephfs does not help.



This have now happened 6-7 times over the last 2 weeks and we suspect that a 
hardware or memory error on one of our nodes may have caused the objects to be 
written to disk with bad checksums. We have replaced the mainboard in one of 
our nodes that we might think is the culprit and are currently testing the 
memory. Can these random checksum errors be caused by anything else that we 
should investigate? It's a bit suspicious that the error only occurs on one 
specific pool? If the mainboard are to blame we should see these errors in more 
pools by now?

Regardless we are stumped by how Ceph handles this error. Checksum-errors 
should not leave clients hanging like this? Should this be considered a bug? Is 
there a way to cancel the blocking I/O request to clear the error? And why is 
the PG flapping between active+recovering+degraded+repair, 
active+recovering+repair, active+clean+repair every few seconds?

Any ideas on how to gracefully battle this problem? Thanks!


--thomas


Thomas Hukkelberg
[email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to