Re: [DRBD-user] The Problem of File System Corruption w/DRBD

Robert Altnoeder Fri, 04 Jun 2021 04:15:16 -0700

On 03 Jun 2021, at 21:41, Eric Robinson <[email protected]> wrote:
> 
> It's a good thing that DRBD faithfully replicates whatever is passed to it. 
> However, since that is true, it does tend to enable the problem of filesystem 
> corruption taking down a whole cluster. I'm just asking people for any 
> suggestions they may have for alleviating that problem. If it’s not fixable, 
> then it’s not fixable. 
>  
> Part of the reason I’m asking is because we’re about to build a whole new 
> data center, and after 15 years of using DRBD we are beginning to look at 
> other HA options, mainly because of the filesystem as a weak point. I should 
> mention that it has *never* happened before, but the thought of it is scary.


Oh, you’ve opened that can of worms, one of my favorite topics ;)

I guess, I have bad news for you, because you have only just found the entrance 
to that rabbit hole. There are *lots* of things that can take down your entire 
cluster, and the filesystem is probably the least of your concerns here, so I 
think you’re looking at the wrong thing here. Unfortunately, none of them can 
be fixed by high-availability, because the problem area that you are talking 
about is not high-availability, it’s high-reliability.

Let me give you a few examples on why high-reliability is something completely 
different than high-availability:

1. Imagine your application ends up in a corrupted state, but keeps running. 
Pacemaker might not even see that - the monitoring possibly just sees that the 
application is still running, so the cluster does not see any need to do 
anything, but the application does not work anymore.

2. Imagine your application crashes and leaves its data behind in a corrupted 
state in a file on a perfectly good filesystem - e.g., crashes after having 
written only 20% of the file’s content. Now Pacemaker restarts the application, 
but due to the corrupted content in its data file, the application cannot 
start. Pacemaker migrates the application to another node, which obviously - 
due to synchronous replication - has the sama data. The application cannot 
start there. The whole game continues until Pacemaker runs out of nodes to try 
and start the application, because it doesn’t work anywhere.

3. Even worse, there could be a bug hidden in Pacemaker or Corosync that 
crashes the cluster software on all nodes at the same time, so that 
high-availability is lost. Then, your application crashes. Nothing’s there to 
restart it anywhere.

4. Ultimate worst case: there could be a bug in the Linux kernel, especially 
somewhere in the network or I/O stack, that crashes all nodes simultaneously - 
especially on operations, where all of the nodes are doing the same thing, 
which is not that atypical for clusters - e.g., repliaction to all nodes, or 
distributed locking, etc.
It’s not even that unlikely.

You might be shocked to hear that it has already happened to me - while 
developing or testing/experimenting, e.g. with experimental code. I have even 
crashed all nodes of an 8 node cluster simultaneously, and not just once. I 
have also had cases where my cluster fenced all its nodes.
It’s not impossible - BUT it’s also not common on a well-tested production 
system that doesn’t continuously run tests of crazy corner cases like I do on 
my test systems.

Obviously, adding more nodes does not solve any of those problems. But the real 
question is whether your use case is so critical that you really need to 
prevent any of those from occuring once (because those don’t seem to happen 
that often, otherwise we would have heard about it).

If it’s really that level of critical, then you’re running the wrong hardware, 
the wrong operating system and the wrong applications, and what you’re really 
looking for is a custom-designed high-reliability (not just high-availability) 
solution, with dissimilar hardware platforms, multiple independent code 
implementations, formally verified software design and implementation, etc. - 
like the ones used for special purpose medical equipment, safety-critical 
industrial equipment, avionics systems, nuclear reactor control, etc. - you get 
the idea. Now you know why those aren’t allowed run on general-purpose hardware 
and software.

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] The Problem of File System Corruption w/DRBD

Reply via email to