recently our ceph cluster very unstable, even replace a failed disk
may trigger a chain reaction,  cause large quantities of osd been
wrongly marked down.
I am not sure if it is because we have near 300 pgs in each sas osds
and small bigger than 300  pgs for ssd osd.

from logs, it all starts from osd_op_tp timed out, then osd no reply,
then large wrongly mark down.

1. 45 machines, each machine has 16 sas and 8 ssd, all file journal in
the osd data dir.
2. use rbd in this cluster
3. 300+ compute node to hold vm
4. osd node current has a hundred thousand threads and fifty thousand
established network connection.
5. dell R730xd, and dell say no hardware error log

so someone else faces the same unstable problem or using 300+ pgs?
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to