[ceph-users] Ceph disk failure causing outage/ stalled writes

Osama Hasebou Wed, 20 Dec 2017 06:11:10 -0800

Hi Everyone, 

We have been having lately a pattern, which is, when a disk fails on CEPH, it 
gets marked as down, while the actual disk might not be faulty yet, and the 
systemd osd process is still showing up.


When trying to kill the process, it doesn't work, and if the machine is 
rebooted, it takes time to reboot, the writes to the cluster gets stalled for a 
good 10-15 mins and actually the machine just shut itself down. 

1 - My question is, do you face such conditions ? Any best practices on how to 
handle maintenance of disks without getting stalled writes to the ceph cluster? 
Do you move it out of the crush production area fix it then push it back in and 
let it all rebalance? 


2 - Lastly, I wanted to know, what would happen when a machine gets shutdown 
due a forced shutdown, would some data in ceph journal be lost? When data is 
being written, does it go to ceph osd journal then to the OSD, then from OSD 
gets replicated to the other 2 and hence if a machine is powered off in a non 
friendly matter would the data in ceph journal partition be gone causing data 
loss or is it on a synchronous mode ? 


Detailed answers are welcome and thanks in advance! 

Ceph version is Jewel 10.2.10. 

Thanks. 

Regards, 
Ossi

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph disk failure causing outage/ stalled writes

Reply via email to