Hi all,

I’ve experienced a strange issue with my cluster.
The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 
SSDs nodes with 5 SSDs each.
All the nodes are behind 3 monitors and 2 different crush maps.
All the cluster is on 10.2.7 

About 20 days ago I started to notice that long backups hangs with "task 
jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
About few days ago another VM start to have high iowait without doing iops also 
on the HDD crush map.

Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
them on HDD crush map. Ceph health was ok and no significant log entries were 
found.
Not all the VMs experienced this problem and in the meanwhile the iops on the 
journal and HDDs was very low even if I was able to do significant iops on the 
working VMs.

After two hours of debug I decided to reboot one of the OSD nodes and the 
cluster start to respond again. Now the OSD node is back in the cluster and the 
problem is disappeared.

Can someone help me to understand what happened?
I see strange entries in the log files like:

accept replacing existing (lossy) channel (new one lossy=1)
fault with nothing to send, going to standby
leveldb manual compact 

I can share all the logs that can help to identify the issue.

Thank you.
Regards,

Matteo


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to