Hi Linux-HA-dev and Alan, We have a contribution to Heartbeat, and would like to get it included into Heartbeat's version control system, so that it will be included in future releases of heartbeat, and later on into the Distribution's packages.
Here is the code: http://svn.drbd.org/drbd/trunk/tools/ This would really ease our lives, because currently it is quite painfull to get it compiled for all the distributions, where we do DRBD-8.0 & Heartbeat clusters. Here is the explanation what it is good for: (You need to read this from DRBD's point of view!) 7 Handle split brain situations; Support IO fencing; New commands: drbdadm outdate r0 When the device is configured this works via an ioctl() call. In the other case it modifies the meta data directly by calling drbdmeta. remove option: on-disconnect New meta-data flag: "Outdated" introduce: disk { fencing [ dont-care | resource-only | resource-and-stonith ]; } handlers { outdate-peer "some script"; } If the disk state of the peer is unknown, drbd calls this handler (yes a call to userspace from kernel space). The handler's returncodes are: 3 -> peer is inconsistent 4 -> peer is outdated (this handler outdated it) [ resource fencing ] 5 -> peer was down / unreachable 6 -> peer is primary 7 -> peer got stonithed [ node fencing ] Let us assume that we have two boxes (N1 and N2) and that these two boxes are connected by two networks (net and cnet [ clinets'-net ]). Net is used by DRBD, while heartbeat uses both, net and cnet I know that you are talking about fencing by STONITH, but DRBD is not limited to that. Here comes my understanding of how resource fencing should works with DRBDv8 : N1 net N2 P/S --- S/P everything up and running. P/? - - S/? network breaks ; N1 freezes IO P/? - - S/? N1 fences N2: In the STONITH case: turn off N2. In the resource fencing case: N1 asks N2 to fence itself from the storage via cnet. HB calls "drbdadm outdate r0" on N2. N2 replies to N1 that fencing is done via cnet. The outdate-peer script on N1 returns sucess to DRBD. P/D - - S/? N1 thaws IO N2 got the the "Outdated" flag set in its meta-data, by the outdate command. The fencing is set to resource-only enables this behaviour. In the resource-only case the outdate-peer handler should have a return value of 3, 4, 5 or 6, but should not return 7. In case "fencing" is set to "resource-and-stonith", all IO operations get immediately frozen (even all currently outstanding IO operations will not finish) upon loss of connection. Then the "outdate-peer" handler is started. In this configuration the outdate peer handler might return any of the documented return values. When the outdate-peer handler returns IO is resumed. Notes: * Why do we need to freeze IO in the "resource-and-stonith" case: Stonith protects you when all communication pathes fail. In that case both (isolated) nodes try to stonith each other. If the current primary would continue to allow IO it could accept transactions, but could get stonithed by the currently secondary node. -> Therefore others could see commited transactions that would be gone after the successfull stonith operation. * The outedate peer handler also gets called if an unconnected secondary wants to become primary. In other words it only may become primary when it knows that the peer is outdated/inconsistent. * We need to store the fact that the peer is outdated/inconsistent in the meta-data. To allow an stand allone primary to be rebooted. -Philipp -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com : _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/