I am deploying Rook 1.10.13 with Ceph 17.2.6 on our Kubernetes clusters. We are
using the Ceph Shared Filesystem a lot and, we have never faced an issue.
Lately, we have deployed it on Oracle Linux 9 VMs (previous/existing
deployments use Centos/RHEL 7) and we are facing the next issue:
We have 30 worker nodes running a StatefulSet with 30 replicas (each one
running on a worker node). The pod in that StatefulSet runs a container with a
java process that waits until jobs are submitted. When a job arrives, it
processes the request and writes the data into a ceph sharedfs. That ceph
sharedfs is a single PVC that is used by all the pods in the StatefulSet.
The problem is that from time to time, some java processes get stuck forever
when accessing the fs… e.g. (more than 6 hours in the snipped text below):
```
"th-0-data-writer-site" #503 [505] prio=5 os_prio=0 cpu=451.11ms
elapsed=22084.19s tid=0x00007f8c3c04db10 nid=505 runnable [0x00007f8d8fdfc000]
java.lang.Thread.State: RUNNABLE
at sun.nio.fs.UnixNativeDispatcher.lstat0(java.base@22-ea/Native Method)
at
sun.nio.fs.UnixNativeDispatcher.lstat(java.base@22-ea/UnixNativeDispatcher.java:351)
at
sun.nio.fs.UnixFileAttributes.get(java.base@22-ea/UnixFileAttributes.java:72)
at
sun.nio.fs.UnixFileSystemProvider.implDelete(java.base@22-ea/UnixFileSystemProvider.java:274)
at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(java.base@22-ea/AbstractFileSystemProvider.java:109)
at java.nio.file.Files.deleteIfExists(java.base@22-ea/Files.java:1191)
at
com.x.streams.dataprovider.FileSystemDataProvider.close(FileSystemDataProvider.java:109)
at
com.x.streams.components.XDataWriter.closeWriters(XDataWriter.java:241)
at
com.x.streams.components.XDataWriter.onTerminate(XDataWriter.java:255)
at com.x.streams.core.StreamReader.doOnTerminate(StreamReader.java:136)
at com.x.streams.core.StreamReader.processData(StreamReader.java:112)
at
com.x.streams.core.ExecutionEngine$ProcessingThreadTask.run(ExecutionEngine.java:604)
at java.lang.Thread.runWith(java.base@22-ea/Thread.java:1583)
at java.lang.Thread.run(java.base@22-ea/Thread.java:1570)
```
Once the system reaches that point, it cannot be recovered until we kill (the
pod of) the active mds replica.
If we look at `ceph health detail`, we see this:
```
[root@rook-ceph-tools-75c947bc9d-ggb7m /]# ceph health detail
HEALTH_WARN 3 clients failing to respond to capability release; 1 MDSs report
slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability
release
mds.ceph-filesystem-a(mds.0): Client worker45:csi-cephfs-node failing to
respond to capability release client_id: 5927564
mds.ceph-filesystem-a(mds.0): Client worker1:csi-cephfs-node failing to
respond to capability release client_id: 7804133
mds.ceph-filesystem-a(mds.0): Client worker39:csi-cephfs-node failing to
respond to capability release client_id: 8391464
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.ceph-filesystem-a(mds.0): 31 slow requests are blocked > 30 secs
```
Any hint about how to troubleshoot it? My intuition says that it could happen
that certain sharedfs released caps cannot reach the MDS and then that portion
of the sharedfs is locked for good. But I am completely making this up. I’d
appreciate it if anyone could provide some indications about how to
troubleshoot it or give me some hints.
We have some clusters running in production with almost the same configuration
(but the OS) and everything runs ok there. But we cannot find the reason why we
are getting this behavior here.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]