Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

Robert LeBlanc Fri, 09 Jan 2015 03:23:08 -0800

On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer <ch...@gol.com> wrote:
> On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:
> Which of course currently means a strongly consistent lockup in these
> scenarios. ^o^


That is one way of putting it

> Slightly off-topic and snarky, that strong consistency is of course of
> limited use when in the case of a corrupted PG Ceph basically asks you to
> toss a coin.
> As in minor corruption, impossible for a mere human to tell which
> replica is the good one, because one OSD is down and the 2 remaining ones
> differ by one bit or so.

This is where checksumming is supposed to come in. I think Sage has been
leading that initiative. Basically, when an OSD reads an object it should
be able to tell if there was bit rot by hashing what it just read and
checking the MD5SUM that it did when it first received the object. If it
doesn't match it can ask another OSD until it finds one that matches.

This provides a number of benefits:

   1. Protect against bit rot. Checked on read and on deep scrub.
   2. Automatically recover the correct version of the object.
   3. If the client computes the MD5SUM before it sent over the wire, the
   data can be guaranteed through the memory of several
   machines/devices/cables/etc.
   4. Getting by with "size" 2 is less risky for those who really want to
   do that.

With all these benefits, there is a trade-off associated with it, mostly
CPU. However with the inclusion of AES in silicon, it may not be a huge
issue now. But, I'm not a programmer and familiar with the aspect of the
Ceph code to be authoritative in any way.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

Reply via email to