On 03/06/2021 13:50, Eric Robinson wrote:
-----Original Message-----
From: Digimer <[email protected]>
Sent: Wednesday, June 2, 2021 7:23 PM
To: Eric Robinson <[email protected]>; [email protected]
Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
On 2021-06-02 5:17 p.m., Eric Robinson wrote:
Since DRBD lives below the filesystem, if the filesystem gets
corrupted, then DRBD faithfully replicates the corruption to the other
node. Thus the filesystem is the SPOF in an otherwise shared-nothing
architecture.
What is the recommended way (if there is one) to avoid the filesystem
SPOF problem when clusters are based on DRBD?
-Eric
To start, HA, like RAID, is not a replacement for backups. That is the answer
to a situation like this... HA (and other availability systems like RAID)
protect
against component failure. If a node fails, the peer recovers automatically
and your services stay online. That's what DRBD and other HA solutions strive
to provide; uptime.
If you want to protect against corruption (accidental or intentional, a-la
cryptolockers), you need a robust backup system to _compliment_ your HA
solution.
Yes, thanks, I've said for many years that HA is not a replacement for disaster
recovery. Still, it is better to avoid downtime than to recover from it, and
one of the main ways to achieve that is through redundancy, preferably a
shared-nothing approach. If I have a cool 5-node cluster and the whole thing
goes down because the filesystem gets corrupted, I can restore from backup, but
management is going to wonder why a 5-node cluster could not provide
availability. So the question remains: how to eliminate the filesystem as the
SPOF?
Some of the things being discussed here have nothing to do with drbd.
drbd provides a raw block level device. It knows nothing about nor cares
what layers you place above it, whether they be filesystems or some
other block layer such as LVM or bcache.
It does a very specific job; ensure the blocks you write to a drbd
device get replicated and stored in real time on one or more other
distributed hosts. If you write a 512byte size block of random garbage
to a drbd device it will (and should) write the exact same garbage to
the other distributed hosts too, so that if you read that same 512byte
block back from any 1 of those individual hosts, you'll get the exact
same garbage back.
The OP stated "if the filesystem gets corrupted, then DRBD faithfully
replicates the corruption to the other node." Good! That's exactly what
we want it to do. What we definitely do NOT want is for drbd to
manipulate the block data given to it in any way whatsoever, we want it
to faithfully replicate this.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user