Hi David, On Fri, 2009-02-20 at 15:44 -0600, David Teigland wrote: > Fencing devices that do not reboot a node, but just cut off storage have > always required the impractical step of re-enabling storage access after the > node has been reset. We've never provided a mechanism to automate this > unfencing. > > Below is an outline of how we might automate unfencing with some simple > extensions to the existing fencing library, config scheme and agents. It does > not involve the fencing daemon (fenced). Nodes would unfence themselves when > they start up. We might also consider a scheme where a node is unfenced by > *other* nodes when it starts up, if that has any advantage over > self-unfencing.
Use case where we need remote unfencing is to recover nodes that boot from the shared storage and those are not that uncommon. I personally don't like the idea of exposing a -U option to users. It's a short cut that could be easily misused in an attempt to recover a node and make more damage than anything else, but I can't see another solution either. > cluster3 is the context, but a similar thing would apply to a next generation > unified fencing system, e.g. > https://www.redhat.com/archives/cluster-devel/2008-October/msg00005.html > > init.d/cman would run: > cman_tool join > fence_node -U <ourname> > qdiskd > groupd > fenced > dlm_controld > gfs_controld > fence_tool join > > The new step fence_node -U <name> would call libfence:fence_node_undo(name). > [fence_node <name> currently calls libfence:fence_node(name) to fence a node.] > > libfence:fence_node_undo(node_name) logic: > for each device_name under given node_name, > if an unfencedevice exists with name=device_name, then > run the unfencedevice agent with first arg of "undo" > and other args the normal combination of node and device args > (any agent used with unfencing must recognize/support "undo") All our agents already support on/off enable/disable operations. It's probably best to align them to have the same config options rather than adding a new one across the board. > > [logic derived from cluster.conf structure and similar to fence_node logic] > > Example 1: > > <clusternode name="foo" nodeid="3"> > <fence> > <method="1"> > <device name="san" node="foo"/> > </method> > </fence> > </clusternode> > > <fencedevices> > <fencedevice name="san" agent="fence_scsi"/> > </fencedevices> > > <unfencedevices> > <unfencedevice name="san" agent="fence_scsi"/> > </unfencedevices> I think that we can avoid the whole <unfence* structure either by overriding the default action="" for that fence method or possibly consider unfencing a special case method. The idea is to contain the whole fence config for the node within the <clusternode> object rather than spreading it even more. For e.g.: <method name="1"> <device name="san" node="foo"/> </method> <method name="unfence"> ... </method> OR <method name="1"> <device name="san" node="foo"/> </method> <method name="2" operation="unfence"> ... </method> (clearly names and format are up for discussion) > > [Note: we've talked about fence_scsi getting a device list from > /etc/cluster/fence_scsi.conf instead of from clvm. It would require > more user configuration, but would create fewer problems and should > be more robust.] I think we should really consider firing up a separate thread for this. It seems to be a more and more often recurring issue. Fabio
