First of all, don't do a ceph upgrade while your cluster is in warning or
error state. A process upgrade must be done from an clean cluster.

Don't stay with a replicate at 2. Majority of problems come from that
point: just look the advices given by experience users of the list. You
should set a replicate of 3 and a min_size at 2. This will prevent you to
fail some data because of a double fault which is frequent.

For your specific problem, i have no idea of the root cause. If you have
already checked your network (tuning parameters, enable jumbo, etc..), your
software version on all the components, your hardware (raid card, system
messages, ...), may be you should just re-install your first OSD server. I
had a big problem after an upgrade from hammer to jewel and nobody seems to
have encountered it doing the same operation. All servers were configured
the same way but they had not the same history.We found that the problem
came from the differents versions we installed on some OSD servers (giant
-> hammer -> jewel). OSD servers which never knew the giant version had no
problem at all. We had on the problematic servers (in jewel) some bugs
which was corrected years ago in giant !!!. So we have to isolate those
servers and reinstall them directly in jewel : it solved the problem.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to