Hi all,
Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors
that we struggle to handle in a non-intrusive manner. Restarting MDS +
hypervisor that accessed the object in question seems to be the only way we can
clear the error so we can repair the PG and recover access. Any pointers on how
to handle this issue in a more gentle way than rebooting the hypervisor and
failing the MDS would be welcome!
The problem seems to only affect one specific pool (id 42) that is used for
cephfs_data. This pool is our second cephfs data pool in this cluster. The data
in the pool is accessible via LXC container via Samba and have the cephfs
filesystem bind-mounted from hypervisor.
Ceph is recently updated to version 16.2.11 (pacific) -- kernel version is
5.13.19-6-pve on OSD-hosts/samba-containers and 5.19.17-2-pve on MDS-hosts.
The following warnings are issued:
$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; Too many
repaired reads on 1 OSDs; Degraded data redundancy: 1/2648430
090 objects degraded (0.000%), 1 pg degraded; 1 slow ops, oldest one blocked
for 608 sec, osd.34 has slow ops
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability
release
mds.hk-cephnode-65(mds.0): Client hk-cephnode-56 failing to respond to
capability release client_id: 9534859837
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
osd.34 had 9936 reads repaired
[WRN] PG_DEGRADED: Degraded data redundancy: 1/2648430090 objects degraded
(0.000%), 1 pg degraded
pg 42.e2 is active+recovering+degraded+repair, acting [34,275,284]
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 608 sec, osd.34 has slow ops
The logs for OSD.34 are flooded with these messages:
root@hk-cephnode-53:~# tail /var/log/ceph/ceph-osd.34.log
2023-04-26T11:41:00.760+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try
copies on 275,284
2023-04-26T11:41:00.784+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try
copies on 275,284
2023-04-26T11:41:00.824+0200 7f03a821f700 -1 osd.34 1352563 get_health_metrics
reporting 1 slow ops, oldest is osd_op(client.9534859837.0:20412906 42.e2
42:4703efac:::10003d86a99.00000001:head [read 0~1048576 [307@0] out=1048576b]
snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e1352553)
2023-04-26T11:41:00.824+0200 7f03a821f700 0 log_channel(cluster) log [WRN] : 1
slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'qa-cephfs_data'
: 1 ])
2023-04-26T11:41:00.840+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on
42:4703efac:::10003d86a99.00000001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 missing primary copy of 42:4703efac:::10003d86a99.00000001:head, will try
copies on 275,284
2023-04-26T11:41:00.888+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] :
42.e2 full-object read crc 0xebd673ed != expected 0xffffffff on
42:4703efac:::10003d86a99.00000001:head
We have tried the following:
- Restarting the OSD in question clears the error for a few seconds but then
we also we get OSD_TOO_MANY_REPAIRS on OSDs with PGs that holds the object that
have blocked I/O.
- Trying to repair the PG seems to restart every 10 second and not actually do
anything/progressing. (Is there a way to check repair progress?)
- Restarting the MDS and hypervisor clears the error (the hypervisor hangs for
several minutes before timing out). However if the object is requested again
the error reoccurs. If we don't access the object we are able to eventually
repair the PG.
- Occasionally setting the primary-affinity to 0 for the primary OSD in the PG
clears the error after restarting all affected OSD and we are able to repair
the PG (unless the object is accessed during recovery) and access to the object
is OK afterwards.
- Finding and deleting the file pointing to the object (10003d86a99) and
restarting OSDs will clear the error.
- Killing the samba process that accessed the object does not clear the
SLOW_OPS, and hence the error prevail
- Normal scrubs have revealed a handfull of other PGs in the same pool (id 42)
that are damaged and we are doing repairs without any problems.
- We believe MDS_CLIENT_LATE_RELEASE and SLOW_OPS errors are symptoms of the
fact that the I/O are blocked.
- We have verified that there are no SMART errors of any kind on any of our
disks in the cluster.
- If we don't handle this issue rather promptly, we experience full lockup of
the samba container and rebooting hypervisor seems to be the only cure. Trying
to force unmount and remount cephfs does not help.
This have now happened 6-7 times over the last 2 weeks and we suspect that a
hardware or memory error on one of our nodes may have caused the objects to be
written to disk with bad checksums. We have replaced the mainboard in one of
our nodes that we might think is the culprit and are currently testing the
memory. Can these random checksum errors be caused by anything else that we
should investigate? It's a bit suspicious that the error only occurs on one
specific pool? If the mainboard are to blame we should see these errors in more
pools by now?
Regardless we are stumped by how Ceph handles this error. Checksum-errors
should not leave clients hanging like this? Should this be considered a bug? Is
there a way to cancel the blocking I/O request to clear the error? And why is
the PG flapping between active+recovering+degraded+repair,
active+recovering+repair, active+clean+repair every few seconds?
Any ideas on how to gracefully battle this problem? Thanks!
--thomas
Thomas Hukkelberg
[email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]