>>> David Vossel <[email protected]> schrieb am 14.05.2013 um 15:54 in >>> Nachricht <[email protected]>: > ----- Original Message ----- >> From: "Lars Ellenberg" <[email protected]> >> To: [email protected] >> Cc: "David Vossel" <[email protected]>, "Fabio M. Di Nitto" ><[email protected]>, "Andrew Beekhof" >> <[email protected]>, "Lars Marowsky-Bree" <[email protected]>, "Lon Hohberger" > <[email protected]>, "Jonathan Brassow" >> <[email protected]>, "Dejan Muhamedagic" <[email protected]> >> Sent: Tuesday, May 14, 2013 6:22:08 AM >> Subject: LVM Resource agent, "exclusive" activation >> >> >> This is about pull request >> https://github.com/ClusterLabs/resource-agents/pull/222 >> "Merge redhat lvm.sh feature set into heartbeat LVM agent" >> >> Apologies to the CC for list duplicates. Cc list was made by looking at >> the comments in the pull request, and some previous off-list thread. >> >> Even though this is about resource agent feature development, >> and thus actually a topic for the -dev list, >> I wanted to give this the maybe wider audience of the users list, >> to encourage feedback from people who actually *use* this feature >> with rgmanager, or intend to use it once it is in the pacemaker RA. >> >> >> >> Here is my perception of this pull request, as such very subjective, and >> I may have gotten some intentions or facts wrong, so please correct me, >> or add whatever I may have missed. >> >> >> Appart from a larger restructuring of the code, this introduces the >> feature of "exclusive activation" of LVM volume groups. >> >> From the commit message: >> >> This patch leaves the original LVM heartbeat functionality >> intact while adding these addition features from the redhat agent. >> >> 1. Exclusive activation using volume group tags. This feature >> allows a volume group to live on shared storage within the cluster >> without requiring the use of cLVM for metadata locking. >> >> 2. individual logical volume activation for local and cluster >> volume groups by using the new 'lvname' option. >> >> 3. Better setup validation when the 'exclusive' option is enabled. >> This patch validates that when exclusive activation is enabled, either >> a cluster volume group is in use with cLVM, or the tags variant is >> configured correctly. These new checks also makes it impossible to >> enable the exclusive activation for cloned resources. >> >> >> That sounds great. Why even discuss it, of course we want that. >> >> But I feel it does not do what it advertises. >> Rather I think it gives a false sense of "exclusivity" >> that is actually not met. >> >> (point 2., individual LV activation is ok with me, I think; >> my difficulties are with the "exclusive by tagging" thingy) >> >> So what does it do. >> >> To activate a VG "exclusively", it uses "LVM tags" (see the LVM >> documentation about these). >> >> Any VG or LV can be tagged with a number of tags. >> Here, only one tag is used (and any other tags will be stripped!). >> >> I try to contrast current behaviour and "exclusive" behaviour: >> >> start: >> non-exclusive: >> just (try to) activate the VG >> exclusive by tag: >> check if a the VG is currently tagged with my node name >> if not, is it tagged at all? >> if tagged, and that happens to be a node name that >> is in the current corosync membership: >> FAIL activation >> else, it is tagged, but that is not a node name, >> or not currently in the membership: >> strip any and all tags, then proceed >> if not FAILed because already tagged by an other member, >> re-tag with *my* nodename >> activate it. >> >> Also it does double check the "ownership" in >> monitor: >> non-exclusive: >> I think due to the high timeout potential under load >> when using any LVM commands, this just checks for the presence >> of the /dev/$VGNAME directory nowadays, which is lightweight, >> and usually good enough (as the services *using* the LVs are >> monitored anyways). >> exclusive by tag: >> it does the above, then, if active, double checks >> that the current node name is also the current tag value, >> and if not (tries to) deactivate (which will usually fail, >> as it can only succeed if it is unused), and returns failure >> to Pacemaker, which will then do its recovery cycle. >> >> By default, Pacemaker would stop all depending resources, >> stop this one, and restart the whole stack. >> >> Which will, in a real split brain situation just >> make sure that nodes will keep stealing it from each other; >> it does not prevent corruption in any way. >> >> In a non-split-brain case, this situation "can not happen" >> anyways. Unless two nodes raced to activate it, >> when it was untagged. >> Oops, so it does not prevent that either. >> >> For completeness, on >> stop: >> non-exclusive: >> just deactivate the VG >> exclusive by tag: >> double check I am the tag "owner" >> then strip that tag (so no tag remains, the VG becomes untagged) >> and deactivate. >> >> So the resource agent tries to double check membership information, >> as it seems to think it is smarter than Pacemaker. >> >> So what does that gain us above just trusting pacemaker? >> >> What does that gain us above >> start: >> strip all current tags >> tag with my node name >> activate >> >> (If we insist on useing tags, >> for whatever other reason we may have to use them) >> >> and, for monitor, you could add a $role=Stopped monitor action, to >> double check that it is not started where it is supposed to be stopped. >> The normal monitoring will only check that it is started where it is >> supposed to be started from Pacemakers point of view. >> >> >> The thing is, Pacemaker primitives will only be started on one node. >> If that node leaves the membership, it will be stonithed, to make sure >> it is really gone, before starting the primitive somewhere else. >> >> So why would the resource agent need to double check that pacemaker did >> the right thing? Why would the resource agent think it is in a better >> position to determine wheter or not it is started somewhere else, >> if it relies on the exact same infrastructure that pacemaker relies on? >> >> What about "split brain": >> exclusivity then can only be ensured by reliable stonith. >> >> If that is in place, pacemaker has already made sure >> that this is started exclusively. >> >> If that is not in place, you get data corruption >> wheter you configured that primitive to be "exclusive" or not: >> the currently active node "owns" the VG, but is not in the membership >> of the node that is about to activate it, which will simply relable the >> thing and activate it anyways. >> >> => setting "exclusive=1" attribute makes you "feel" safer, >> but you are not. >> That is a Bad Thing. >> >> In the github comment function, >> Lon wrote: >> > I believe the tagging/stripping and the way it's implemented is designed >> > to prevent a few things: >> > 1) obvious administrative errors within a running cluster: - clone >> > resource that really must NEVER be cloned - executing agent directly >> > while it's active (and other things) >> >> So we don't trust the Admin. >> What if the admin forgets to add "exclusive=1" to the primitive? >> >> If he does not foget that, it is sufficient to just fail all actions >> (but stop), if we $exclusive and operated as clone. >> >> > 2) bugs in the resource manager - betting your entire volume group >> > on "no bugs"? >> > "Don't do that" only goes so far, and is little comfort to an >> > administrator who has corrupt LVM metadata. >> >> Uhm, ok. >> So the failure scenario is that a Pacemaker bug could lead to >> start of one primitive on more than one node, >> and the resource agent is supposed to detect that and fail. >> >> So you don't trust Pacemaker, but you trust corosync and stonith, >> and you trust that your resource agent by checking tags against >> membership before overriding them gets it right. >> >> Also, once a supposedly exclusive VG is activated concurrently, >> chances are that potential LVM *meta* data corruption is less of a >> concern: you already have *data* corruption due to concurrent >> modifications, one node doing journal replays of stuff that is live on >> the other. >> >> > There's probably some other bits and pieces I've forgotten; >> > Jon Brassow would know. >> >> Hey Jon ;-) >> >> Anyone, any input? >> >> Is there any real use case for this implementation other than >> "I don't trust Pacemaker to be able to count to 1, >> but I still rely on the rest of the infrastructure"? > > Here's what it comes down to. You aren't guaranteed exclusive activation > just because pacemaker is in control. There are scenarios with SAN disks > where the node starts up and can potentially attempt to activate a volume > before pacemaker has initialized. Pacemaker would then shut down the volume > immediately, but at that point its too late. The volume got activated on > multiple nodes when we explicitly wanted exclusive activation. The only way > to guarantee exclusive activation without cLVMd is to use this node tagging > feature to filter who is allowed to access the volume outside of pacemaker. > > We trust pacemaker, but pacemaker isn't always in control when it comes to > exclusive activation. This feature accounts for that. Jon would be able to > answer more specific questions about this. > > -- Vossel
Hi! >From what I know from HP-UX LVM is this: * You cannot activate an cluster VG unless the LVM daemon is running * The VG (PVs) have a flag showing whether they are active or not; so even in split brain, two nodes cannot activate the same VG * The cluster explicitly has to know the cluster VGs Finally: It works well ;-) Regards, Ulrich > >> >> Maybe that is a valid scenario. >> I just feel this is a layer violation, >> and it results in a false sense of safety, >> which it actually does not provide. >> >> It seems to try to "simulate" scsi3 persistent reservations >> with unsuitable means. >> >> What I'm suggesting is to clearly define >> what the goal of this "exclusive" feature is to be. >> >> Then check again if we really want that, >> or if it is actually already covered >> by Pacemaker some way or other. >> >> Maybe this has been originally implemented on request of some customer, >> which was happy once he was able to say "exclusive=1", without thinking >> about technical details? >> >> Or maybe I'm just missing the point completely. >> >> Thanks, >> Lars >> >> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
