On 12/22/2011 09:19 PM, Ulrich Windl wrote: > Hello! > > Heading the DRBD Guide for DRBD with OCFS (with pacemaker), it suggests that > fencing needs to be done whenever there is a problem with one of the nodes > running DRBD. > > I really wonder why: Why shoot the node if one out of several resources has a > problem? Why not try a disconnect/reconnect first? It should be faster anyway. > > Also if you are using different networks for cluster, access, and > replication, why assume that the cluster communication is dead if one DRBD > resource has a problem? While it may sound increadibly cool for the > developers to reset any node in the cluster, this is the most annoying thing > in practice, especially as you have little chances for debugging the problems. > > Would someone explain the rationale behind?
Kind of depends on exactly what's broken, but, in general, if *any* filesystem/storage resource fails and cannot be cleanly stopped on some node, the only safe thing to do is kill the entire node. If you don't do this, you can't safely restart the resource on another node (non-clustered filesystem), and/or can't continue writing to a clustered filesystem like OCFS2 on some surviving node without risking data corruption. Slightly more specifically regarding dual primary DRBD, from http://www.linbit.com/en/education/tech-guides/dual-primary-think-twice/ "When having a Dual-Primary resource, we principally have to assume that as soon as the two nodes sharing that DRBD drive get disconnected from each other, uncoordinated write attempts can happen on either of them. Measures need to be taken to make sure that when node is in trouble, that node can not cause corruption of a set of data anymore - welcome to fencing." Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE [email protected] _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
