This is about pull request https://github.com/ClusterLabs/resource-agents/pull/222 "Merge redhat lvm.sh feature set into heartbeat LVM agent"
Apologies to the CC for list duplicates. Cc list was made by looking at the comments in the pull request, and some previous off-list thread. Even though this is about resource agent feature development, and thus actually a topic for the -dev list, I wanted to give this the maybe wider audience of the users list, to encourage feedback from people who actually *use* this feature with rgmanager, or intend to use it once it is in the pacemaker RA. Here is my perception of this pull request, as such very subjective, and I may have gotten some intentions or facts wrong, so please correct me, or add whatever I may have missed. Appart from a larger restructuring of the code, this introduces the feature of "exclusive activation" of LVM volume groups. >From the commit message: This patch leaves the original LVM heartbeat functionality intact while adding these addition features from the redhat agent. 1. Exclusive activation using volume group tags. This feature allows a volume group to live on shared storage within the cluster without requiring the use of cLVM for metadata locking. 2. individual logical volume activation for local and cluster volume groups by using the new 'lvname' option. 3. Better setup validation when the 'exclusive' option is enabled. This patch validates that when exclusive activation is enabled, either a cluster volume group is in use with cLVM, or the tags variant is configured correctly. These new checks also makes it impossible to enable the exclusive activation for cloned resources. That sounds great. Why even discuss it, of course we want that. But I feel it does not do what it advertises. Rather I think it gives a false sense of "exclusivity" that is actually not met. (point 2., individual LV activation is ok with me, I think; my difficulties are with the "exclusive by tagging" thingy) So what does it do. To activate a VG "exclusively", it uses "LVM tags" (see the LVM documentation about these). Any VG or LV can be tagged with a number of tags. Here, only one tag is used (and any other tags will be stripped!). I try to contrast current behaviour and "exclusive" behaviour: start: non-exclusive: just (try to) activate the VG exclusive by tag: check if a the VG is currently tagged with my node name if not, is it tagged at all? if tagged, and that happens to be a node name that is in the current corosync membership: FAIL activation else, it is tagged, but that is not a node name, or not currently in the membership: strip any and all tags, then proceed if not FAILed because already tagged by an other member, re-tag with *my* nodename activate it. Also it does double check the "ownership" in monitor: non-exclusive: I think due to the high timeout potential under load when using any LVM commands, this just checks for the presence of the /dev/$VGNAME directory nowadays, which is lightweight, and usually good enough (as the services *using* the LVs are monitored anyways). exclusive by tag: it does the above, then, if active, double checks that the current node name is also the current tag value, and if not (tries to) deactivate (which will usually fail, as it can only succeed if it is unused), and returns failure to Pacemaker, which will then do its recovery cycle. By default, Pacemaker would stop all depending resources, stop this one, and restart the whole stack. Which will, in a real split brain situation just make sure that nodes will keep stealing it from each other; it does not prevent corruption in any way. In a non-split-brain case, this situation "can not happen" anyways. Unless two nodes raced to activate it, when it was untagged. Oops, so it does not prevent that either. For completeness, on stop: non-exclusive: just deactivate the VG exclusive by tag: double check I am the tag "owner" then strip that tag (so no tag remains, the VG becomes untagged) and deactivate. So the resource agent tries to double check membership information, as it seems to think it is smarter than Pacemaker. So what does that gain us above just trusting pacemaker? What does that gain us above start: strip all current tags tag with my node name activate (If we insist on useing tags, for whatever other reason we may have to use them) and, for monitor, you could add a $role=Stopped monitor action, to double check that it is not started where it is supposed to be stopped. The normal monitoring will only check that it is started where it is supposed to be started from Pacemakers point of view. The thing is, Pacemaker primitives will only be started on one node. If that node leaves the membership, it will be stonithed, to make sure it is really gone, before starting the primitive somewhere else. So why would the resource agent need to double check that pacemaker did the right thing? Why would the resource agent think it is in a better position to determine wheter or not it is started somewhere else, if it relies on the exact same infrastructure that pacemaker relies on? What about "split brain": exclusivity then can only be ensured by reliable stonith. If that is in place, pacemaker has already made sure that this is started exclusively. If that is not in place, you get data corruption wheter you configured that primitive to be "exclusive" or not: the currently active node "owns" the VG, but is not in the membership of the node that is about to activate it, which will simply relable the thing and activate it anyways. => setting "exclusive=1" attribute makes you "feel" safer, but you are not. That is a Bad Thing. In the github comment function, Lon wrote: > I believe the tagging/stripping and the way it's implemented is designed > to prevent a few things: > 1) obvious administrative errors within a running cluster: - clone > resource that really must NEVER be cloned - executing agent directly > while it's active (and other things) So we don't trust the Admin. What if the admin forgets to add "exclusive=1" to the primitive? If he does not foget that, it is sufficient to just fail all actions (but stop), if we $exclusive and operated as clone. > 2) bugs in the resource manager - betting your entire volume group > on "no bugs"? > "Don't do that" only goes so far, and is little comfort to an > administrator who has corrupt LVM metadata. Uhm, ok. So the failure scenario is that a Pacemaker bug could lead to start of one primitive on more than one node, and the resource agent is supposed to detect that and fail. So you don't trust Pacemaker, but you trust corosync and stonith, and you trust that your resource agent by checking tags against membership before overriding them gets it right. Also, once a supposedly exclusive VG is activated concurrently, chances are that potential LVM *meta* data corruption is less of a concern: you already have *data* corruption due to concurrent modifications, one node doing journal replays of stuff that is live on the other. > There's probably some other bits and pieces I've forgotten; > Jon Brassow would know. Hey Jon ;-) Anyone, any input? Is there any real use case for this implementation other than "I don't trust Pacemaker to be able to count to 1, but I still rely on the rest of the infrastructure"? Maybe that is a valid scenario. I just feel this is a layer violation, and it results in a false sense of safety, which it actually does not provide. It seems to try to "simulate" scsi3 persistent reservations with unsuitable means. What I'm suggesting is to clearly define what the goal of this "exclusive" feature is to be. Then check again if we really want that, or if it is actually already covered by Pacemaker some way or other. Maybe this has been originally implemented on request of some customer, which was happy once he was able to say "exclusive=1", without thinking about technical details? Or maybe I'm just missing the point completely. Thanks, Lars _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems