[Linux-HA] Antw: Re: LVM Resource agent, "exclusive" activation

Ulrich Windl Tue, 14 May 2013 23:25:51 -0700

>>> David Vossel <[email protected]> schrieb am 14.05.2013 um 15:54 in 
>>> Nachricht
<[email protected]>:
> ----- Original Message -----
>> From: "Lars Ellenberg" <[email protected]>
>> To: [email protected] 
>> Cc: "David Vossel" <[email protected]>, "Fabio M. Di Nitto" 
><[email protected]>, "Andrew Beekhof"
>> <[email protected]>, "Lars Marowsky-Bree" <[email protected]>, "Lon Hohberger" 
> <[email protected]>, "Jonathan Brassow"
>> <[email protected]>, "Dejan Muhamedagic" <[email protected]>
>> Sent: Tuesday, May 14, 2013 6:22:08 AM
>> Subject: LVM Resource agent, "exclusive" activation
>> 
>> 
>> This is about pull request
>> https://github.com/ClusterLabs/resource-agents/pull/222 
>> "Merge redhat lvm.sh feature set into heartbeat LVM agent"
>> 
>> Apologies to the CC for list duplicates.  Cc list was made by looking at
>> the comments in the pull request, and some previous off-list thread.
>> 
>> Even though this is about resource agent feature development,
>> and thus actually a topic for the -dev list,
>> I wanted to give this the maybe wider audience of the users list,
>> to encourage feedback from people who actually *use* this feature
>> with rgmanager, or intend to use it once it is in the pacemaker RA.
>> 
>> 
>> 
>> Here is my perception of this pull request, as such very subjective, and
>> I may have gotten some intentions or facts wrong, so please correct me,
>> or add whatever I may have missed.
>> 
>> 
>> Appart from a larger restructuring of the code, this introduces the
>> feature of "exclusive activation" of LVM volume groups.
>> 
>> From the commit message:
>> 
>>      This patch leaves the original LVM heartbeat functionality
>>      intact while adding these addition features from the redhat agent.
>> 
>>      1. Exclusive activation using volume group tags. This feature
>>      allows a volume group to live on shared storage within the cluster
>>      without requiring the use of cLVM for metadata locking.
>> 
>>      2. individual logical volume activation for local and cluster
>>      volume groups by using the new 'lvname' option.
>> 
>>      3. Better setup validation when the 'exclusive' option is enabled.
>>      This patch validates that when exclusive activation is enabled, either
>>      a cluster volume group is in use with cLVM, or the tags variant is
>>      configured correctly. These new checks also makes it impossible to
>>      enable the exclusive activation for cloned resources.
>> 
>> 
>> That sounds great. Why even discuss it, of course we want that.
>> 
>> But I feel it does not do what it advertises.
>> Rather I think it gives a false sense of "exclusivity"
>> that is actually not met.
>> 
>> (point 2., individual LV activation is ok with me, I think;
>>  my difficulties are with the "exclusive by tagging" thingy)
>> 
>> So what does it do.
>> 
>> To activate a VG "exclusively", it uses "LVM tags" (see the LVM
>> documentation about these).
>> 
>> Any VG or LV can be tagged with a number of tags.
>> Here, only one tag is used (and any other tags will be stripped!).
>> 
>> I try to contrast current behaviour and "exclusive" behaviour:
>> 
>> start:
>>     non-exclusive:
>>      just (try to) activate the VG
>>     exclusive by tag:
>>      check if a the VG is currently tagged with my node name
>>      if not, is it tagged at all?
>>             if tagged, and that happens to be a node name that
>>             is in the current corosync membership:
>>                FAIL activation
>>          else, it is tagged, but that is not a node name,
>>             or not currently in the membership:
>>                strip any and all tags, then proceed
>>      if not FAILed because already tagged by an other member,
>>      re-tag with *my* nodename
>>      activate it.
>> 
>> Also it does double check the "ownership" in
>> monitor:
>>     non-exclusive:
>>         I think due to the high timeout potential under load
>>      when using any LVM commands, this just checks for the presence
>>      of the /dev/$VGNAME directory nowadays, which is lightweight,
>>      and usually good enough (as the services *using* the LVs are
>>      monitored anyways).
>>     exclusive by tag:
>>         it does the above, then, if active, double checks
>>      that the current node name is also the current tag value,
>>      and if not (tries to) deactivate (which will usually fail,
>>      as it can only succeed if it is unused), and returns failure
>>      to Pacemaker, which will then do its recovery cycle.
>> 
>>      By default, Pacemaker would stop all depending resources,
>>      stop this one, and restart the whole stack.
>> 
>>           Which will, in a real split brain situation just
>>        make sure that nodes will keep stealing it from each other;
>>        it does not prevent corruption in any way.
>> 
>>        In a non-split-brain case, this situation "can not happen"
>>        anyways.  Unless two nodes raced to activate it,
>>        when it was untagged.
>>        Oops, so it does not prevent that either.
>> 
>> For completeness, on
>> stop:
>>    non-exclusive:
>>        just deactivate the VG
>>    exclusive by tag:
>>        double check I am the tag "owner"
>>        then strip that tag (so no tag remains, the VG becomes untagged)
>>        and deactivate.
>>        
>> So the resource agent tries to double check membership information,
>> as it seems to think it is smarter than Pacemaker.
>> 
>> So what does that gain us above just trusting pacemaker?
>> 
>> What does that gain us above
>> start:
>>    strip all current tags
>>    tag with my node name
>>    activate
>> 
>> (If we insist on useing tags,
>> for whatever other reason we may have to use them)
>> 
>> and, for monitor, you could add a $role=Stopped monitor action, to
>> double check that it is not started where it is supposed to be stopped.
>> The normal monitoring will only check that it is started where it is
>> supposed to be started from Pacemakers point of view.
>> 
>> 
>> The thing is, Pacemaker primitives will only be started on one node.
>> If that node leaves the membership, it will be stonithed, to make sure
>> it is really gone, before starting the primitive somewhere else.
>> 
>> So why would the resource agent need to double check that pacemaker did
>> the right thing? Why would the resource agent think it is in a better
>> position to determine wheter or not it is started somewhere else,
>> if it relies on the exact same infrastructure that pacemaker relies on?
>> 
>> What about "split brain":
>> exclusivity then can only be ensured by reliable stonith.
>> 
>> If that is in place, pacemaker has already made sure
>> that this is started exclusively.
>> 
>> If that is not in place, you get data corruption
>> wheter you configured that primitive to be "exclusive" or not:
>> the currently active node "owns" the VG, but is not in the membership
>> of the node that is about to activate it, which will simply relable the
>> thing and activate it anyways.
>> 
>>    => setting "exclusive=1" attribute makes you "feel" safer,
>>    but you are not.
>>    That is a Bad Thing.
>> 
>> In the github comment function,
>> Lon wrote:
>>   > I believe the tagging/stripping and the way it's implemented is designed
>>   > to prevent a few things:
>>   > 1) obvious administrative errors within a running cluster: - clone
>>   >    resource that really must NEVER be cloned - executing agent directly
>>   >    while it's active (and other things)
>> 
>>   So we don't trust the Admin.
>>   What if the admin forgets to add "exclusive=1" to the primitive?
>> 
>>   If he does not foget that, it is sufficient to just fail all actions
>>   (but stop), if we $exclusive and operated as clone.
>> 
>>   > 2) bugs in the resource manager - betting your entire volume group
>>   > on "no bugs"?
>>   > "Don't do that" only goes so far, and is little comfort to an
>>   > administrator who has corrupt LVM metadata.
>>   
>>   Uhm, ok.
>>   So the failure scenario is that a Pacemaker bug could lead to
>>   start of one primitive on more than one node,
>>   and the resource agent is supposed to detect that and fail.
>> 
>>   So you don't trust Pacemaker, but you trust corosync and stonith,
>>   and you trust that your resource agent by checking tags against
>>   membership before overriding them gets it right.
>> 
>>   Also, once a supposedly exclusive VG is activated concurrently,
>>   chances are that potential LVM *meta* data corruption is less of a
>>   concern: you already have *data* corruption due to concurrent
>>   modifications, one node doing journal replays of stuff that is live on
>>   the other.
>> 
>>   > There's probably some other bits and pieces I've forgotten;
>>   > Jon Brassow would know.
>> 
>> Hey Jon ;-)
>> 
>> Anyone, any input?
>> 
>> Is there any real use case for this implementation other than
>> "I don't trust Pacemaker to be able to count to 1,
>> but I still rely on the rest of the infrastructure"?
> 
> Here's what it comes down to.  You aren't guaranteed exclusive activation 
> just because pacemaker is in control. There are scenarios with SAN disks 
> where the node starts up and can potentially attempt to activate a volume 
> before pacemaker has initialized.  Pacemaker would then shut down the volume 
> immediately, but at that point its too late.  The volume got activated on 
> multiple nodes when we explicitly wanted exclusive activation. The only way 
> to guarantee exclusive activation without cLVMd is to use this node tagging 
> feature to filter who is allowed to access the volume outside of pacemaker.
> 
> We trust pacemaker, but pacemaker isn't always in control when it comes to 
> exclusive activation.  This feature accounts for that.  Jon would be able to 
> answer more specific questions about this.
> 
> -- Vossel


Hi!

>From what I know from HP-UX LVM is this:
* You cannot activate an cluster VG unless the LVM daemon is running
* The VG (PVs) have a flag showing whether they are active or not; so even in 
split brain, two nodes cannot activate the same VG
* The cluster explicitly has to know the cluster VGs

Finally: It works well ;-)

Regards,
Ulrich

> 
>> 
>> Maybe that is a valid scenario.
>> I just feel this is a layer violation,
>> and it results in a false sense of safety,
>> which it actually does not provide.
>> 
>> It seems to try to "simulate" scsi3 persistent reservations
>> with unsuitable means.
>> 
>> What I'm suggesting is to clearly define
>> what the goal of this "exclusive" feature is to be.
>> 
>> Then check again if we really want that,
>> or if it is actually already covered
>> by Pacemaker some way or other.
>> 
>> Maybe this has been originally implemented on request of some customer,
>> which was happy once he was able to say "exclusive=1", without thinking
>> about technical details?
>> 
>> Or maybe I'm just missing the point completely.
>> 
>> Thanks,
>>      Lars
>> 
>> 
> _______________________________________________
> Linux-HA mailing list
> [email protected] 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Antw: Re: LVM Resource agent, "exclusive" activation

Reply via email to