Thank you Sage for the thorough answer. It just occurred to me to also ask about the gateway. The docs explain that one can supply content-md5 during an object PUT (which I assume is verified by the RGW), but does a GET respond with the ETag md5? (Sorry, I don't have a gateway running at the moment to check for myself, and the answer is relevant to this discussion anyway).
Cheers, Dan Sage Weil <s...@inktank.com> wrote: >On Wed, 16 Oct 2013, Dan Van Der Ster wrote: >> Hi all, >> There has been some confusion the past couple days at the CHEP >> conference during conversations about Ceph and protection from bit >flips >> or other subtle data corruption. Can someone please summarise the >> current state of data integrity protection in Ceph, assuming we have >an >> XFS backend filesystem? ie. don't rely on the protection offered by >> btrfs. I saw in the docs that wire messages and journal writes are >> CRC'd, but nothing explicit about the objects themselves. > >- Everything that passes over the wire is checksummed (crc32c). This >is >mainly because the TCP checksum is so weak. > >- The journal entries have a crc. > >- During deep scrub, we read the objects and metadata, calculate a >crc32c, >and compare across replicas. This detects missing objects, bitrot, >failing disks, or anything other source of inconistency. > >- Ceph does not calculate and store a per-object checksum. Doing so is > >difficult because rados allows arbitrary overwrites of parts of an >object. > >- Ceph *does* have a new opportunistic checksum feature, which is >currently only enabled in QA. It calculates and stores checksums on >whatever block size you configure (e.g., 64k) if/when we >write/overwrite a >complete block, and will verify any complete block read against the >stored >crc, if one happens to be available. This can help catch some but not >all >sources of corruption. > >> We also have some specific questions: >> >> 1. Is an object checksum stored on the OSD somewhere? Is this in >user.ceph._, because it wasn't obvious when looking at the code? > >No (except for the new/experimental opportunistic crc I mention above). > >> 2. When is the checksum verified. Surely it is checked during the >deep scrub, but what about during an object read? > >For non-btrfs, no crc to verify. For btrfs, the fs has its own crc and > >verifies it. > >> 2b. Can a user read corrupted data if the master replica has a bit >flip but this hasn't yet been found by a deep scrub? > >Yes. > >> 3. During deep scrub of an object with 2 replicas, suppose the >checksum is different for the two objects -- which object wins? (I.e. >if you store the checksum locally, this is trivial since the >consistency of objects can be evaluated locally. Without the local >checksum, you can have conflicts.) > >In this case we normally choose the primary. The repair has to be >explicitly triggered by the admin, however, and there are some options >to >control that choice. > >> 4. If the checksum is already stored per object in the OSD, is this >retrievable by librados? We have some applications which also need to >know the checksum of the data and this would be handy if it was already >calculated by Ceph. > >It would! It may be that the way to get there is to build and API to >expose the opportunistic checksums, and/or to extend that feature to >maintain full checksums (by re-reading partially overwritten blocks on >write). (Note, however, that even this wouldn't cover xattrs and omap >content; really this is something that "should" be handled by the >backend >storage/file system.) > >sage
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com