Thank you Sage for the thorough answer.

It just occurred to me to also ask about the gateway. The docs explain that one 
can supply content-md5 during an object PUT (which I assume is verified by the 
RGW), but does a GET respond with the ETag md5? (Sorry, I don't have a gateway 
running at the moment to check for myself, and the answer is relevant to this 
discussion anyway).

Cheers,
Dan

Sage Weil <s...@inktank.com> wrote:
>On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
>> Hi all,
>> There has been some confusion the past couple days at the CHEP 
>> conference during conversations about Ceph and protection from bit
>flips 
>> or other subtle data corruption. Can someone please summarise the 
>> current state of data integrity protection in Ceph, assuming we have
>an 
>> XFS backend filesystem? ie. don't rely on the protection offered by 
>> btrfs. I saw in the docs that wire messages and journal writes are 
>> CRC'd, but nothing explicit about the objects themselves.
>
>- Everything that passes over the wire is checksummed (crc32c).  This
>is 
>mainly because the TCP checksum is so weak.
>
>- The journal entries have a crc.
>
>- During deep scrub, we read the objects and metadata, calculate a
>crc32c, 
>and compare across replicas.  This detects missing objects, bitrot, 
>failing disks, or anything other source of inconistency.
>
>- Ceph does not calculate and store a per-object checksum.  Doing so is
>
>difficult because rados allows arbitrary overwrites of parts of an
>object.
>
>- Ceph *does* have a new opportunistic checksum feature, which is 
>currently only enabled in QA.  It calculates and stores checksums on 
>whatever block size you configure (e.g., 64k) if/when we
>write/overwrite a 
>complete block, and will verify any complete block read against the
>stored 
>crc, if one happens to be available.  This can help catch some but not
>all 
>sources of corruption.
>
>> We also have some specific questions:
>> 
>> 1. Is an object checksum stored on the OSD somewhere? Is this in
>user.ceph._, because it wasn't obvious when looking at the code?
>
>No (except for the new/experimental opportunistic crc I mention above).
>
>> 2. When is the checksum verified. Surely it is checked during the
>deep scrub, but what about during an object read?
>
>For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and
>
>verifies it.
>
>> 2b. Can a user read corrupted data if the master replica has a bit
>flip but this hasn't yet been found by a deep scrub?
>
>Yes.
>
>> 3. During deep scrub of an object with 2 replicas, suppose the
>checksum is different for the two objects -- which object wins? (I.e.
>if you store the checksum locally, this is trivial since the
>consistency of objects can be evaluated locally. Without the local
>checksum, you can have conflicts.)
>
>In this case we normally choose the primary.  The repair has to be 
>explicitly triggered by the admin, however, and there are some options
>to 
>control that choice.
>
>> 4. If the checksum is already stored per object in the OSD, is this
>retrievable by librados? We have some applications which also need to
>know the checksum of the data and this would be handy if it was already
>calculated by Ceph.
>
>It would!  It may be that the way to get there is to build and API to 
>expose the opportunistic checksums, and/or to extend that feature to 
>maintain full checksums (by re-reading partially overwritten blocks on 
>write).  (Note, however, that even this wouldn't cover xattrs and omap 
>content; really this is something that "should" be handled by the
>backend 
>storage/file system.)
>
>sage
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to