Hi,
We have been having a lot of discussions at my workplace about whether to
employ a Ceph cluster in production or not, and if yes, how to set up the
hardware for it. During that discussion, I mentioned that, according to the
documentation, we should see significant speedups from using dedicated SSDs
for the OSD's journals. Unfortunately, my colleagues did not like this idea at
all - many of them had bad experiences with SSDs failing or at least read a
lot about that on the Internet, and there's a general consensus that SSDs are
just not quite reliable enough yet for production servers.
This leads me to the question: What exactly can happen if an OSD's journal
device suddenly fails during operations? Can that lead to data loss or
corruption, or disruptions of the service?
In my experience with the small three-machine test cluster I have here, a
single failed node usually would lead to a pretty severe outage of the entire
cluster on the order of ten minutes or more (probably much more when it's a
really big node that fails), though so far no data loss or corruption...
Regards,
Guido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html