Re: [gentoo-user] OT: btrfs raid 5/6

Rich Freeman Thu, 07 Dec 2017 15:10:16 -0800

On Thu, Dec 7, 2017 at 11:04 AM, Frank Steinmetzger <war...@gmx.de> wrote:
> On Thu, Dec 07, 2017 at 10:26:34AM -0500, Rich Freeman wrote:
>
>> […]  They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based
>> solutions.  Maybe you can get by with less, but finding ARM systems with
>> even 4GB of RAM is tough, and even that means only one hard drive per
>> node, which means a lot of $40+ nodes to go on top of the cost of the
>> drives themselves.
>
> You can't really get ECC on ARM, right? So M-ITX was the next best choice. I
> have a tiny (probably one of the smallest available) M-ITX case for four
> 3,5″ bays and an internal 2.5″ mount:
> https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100
>


I don't think ECC is readily available on ARM (most of those boards
are SoCs where the RAM is integral and can't be expanded).  If CephFS
were designed with end-to-end checksums that wouldn't really matter
much, because the client would detect any error in a storage node and
could obtain a good copy from another node and trigger a resilver.
However, I don't think Ceph is quite there, with checksums being used
at various points but I think there are gaps where no checksum is
protecting the data.  That is one of the things I don't like about it.

If I were designing the checksums for it I'd probably have the client
compute the checksum and send it with the data, then at every step the
checksum is checked, and stored in the metadata on permanent storage.
Then when the ack goes back to the client that the data is written the
checksum would be returned to the client from the metadata, and the
client would do a comparison.  Any retrieval would include the client
obtaining the checksum from the metadata and then comparing it to the
data from the storage nodes.  I don't think this approach would really
add any extra overhead (the metadata needs to be recorded when writing
anyway, and read when reading anyway).  It just ensures there is a
checksum on separate storage from the data and that it is the one
captured when the data was first written.  A storage node could be
completely unreliable in this scenario as it exists apart from the
checksum being used to verify it.  Storage nodes would still do their
own checksum verification anyway since that would allow errors to be
detected sooner and reduce latency, but this is not essential to
reliability.

Instead I think Ceph does not store checksums in the metadata.  The
client checksum is used to verify accurate transfer over the network,
but then the various nodes forget about it, and record the data.  If
the data is backed on ZFS/btrfs/bluestore then the filesystem would
compute its own checksum to detect silent corruption while at rest.
However, if the data were corrupted by faulty software or memory
failure after it was verified upon reception but before it was
re-checksummed prior to storage then you would have a problem.  In
that case a scrub would detect non-matching data between nodes but
with no way to determine which node is correct.

If somebody with more knowledge of Ceph knows otherwise I'm all ears,
because this is one of those things that gives me a bit of pause.
Don't get me wrong - most other approaches have the same issues, but I
can reduce the risk of some of that with ECC, but that isn't practical
when you want many RAM-intensive storage nodes in the solution.

-- 
Rich

Re: [gentoo-user] OT: btrfs raid 5/6

Reply via email to