This is about pull request
https://github.com/ClusterLabs/resource-agents/pull/222
"Merge redhat lvm.sh feature set into heartbeat LVM agent"

Apologies to the CC for list duplicates.  Cc list was made by looking at
the comments in the pull request, and some previous off-list thread.

Even though this is about resource agent feature development,
and thus actually a topic for the -dev list,
I wanted to give this the maybe wider audience of the users list,
to encourage feedback from people who actually *use* this feature
with rgmanager, or intend to use it once it is in the pacemaker RA.



Here is my perception of this pull request, as such very subjective, and
I may have gotten some intentions or facts wrong, so please correct me,
or add whatever I may have missed.


Appart from a larger restructuring of the code, this introduces the
feature of "exclusive activation" of LVM volume groups.

>From the commit message:

        This patch leaves the original LVM heartbeat functionality
        intact while adding these addition features from the redhat agent.

        1. Exclusive activation using volume group tags. This feature
        allows a volume group to live on shared storage within the cluster
        without requiring the use of cLVM for metadata locking.

        2. individual logical volume activation for local and cluster
        volume groups by using the new 'lvname' option.

        3. Better setup validation when the 'exclusive' option is enabled.
        This patch validates that when exclusive activation is enabled, either
        a cluster volume group is in use with cLVM, or the tags variant is
        configured correctly. These new checks also makes it impossible to
        enable the exclusive activation for cloned resources.


That sounds great. Why even discuss it, of course we want that.

But I feel it does not do what it advertises.
Rather I think it gives a false sense of "exclusivity"
that is actually not met.

(point 2., individual LV activation is ok with me, I think;
 my difficulties are with the "exclusive by tagging" thingy)

So what does it do.

To activate a VG "exclusively", it uses "LVM tags" (see the LVM
documentation about these).

Any VG or LV can be tagged with a number of tags.
Here, only one tag is used (and any other tags will be stripped!).

I try to contrast current behaviour and "exclusive" behaviour:

start:
    non-exclusive:
        just (try to) activate the VG
    exclusive by tag:
        check if a the VG is currently tagged with my node name
        if not, is it tagged at all?
            if tagged, and that happens to be a node name that
               is in the current corosync membership:
                  FAIL activation
            else, it is tagged, but that is not a node name,
               or not currently in the membership:
                  strip any and all tags, then proceed
        if not FAILed because already tagged by an other member,
        re-tag with *my* nodename
        activate it.

Also it does double check the "ownership" in
monitor:
    non-exclusive:
        I think due to the high timeout potential under load
        when using any LVM commands, this just checks for the presence
        of the /dev/$VGNAME directory nowadays, which is lightweight,
        and usually good enough (as the services *using* the LVs are
        monitored anyways).
    exclusive by tag:
        it does the above, then, if active, double checks 
        that the current node name is also the current tag value,
        and if not (tries to) deactivate (which will usually fail,
        as it can only succeed if it is unused), and returns failure
        to Pacemaker, which will then do its recovery cycle.

        By default, Pacemaker would stop all depending resources,
        stop this one, and restart the whole stack.

          Which will, in a real split brain situation just
          make sure that nodes will keep stealing it from each other;
          it does not prevent corruption in any way.

          In a non-split-brain case, this situation "can not happen"
          anyways.  Unless two nodes raced to activate it,
          when it was untagged.
          Oops, so it does not prevent that either.

For completeness, on
stop:
   non-exclusive:
       just deactivate the VG
   exclusive by tag:
       double check I am the tag "owner"
       then strip that tag (so no tag remains, the VG becomes untagged)
       and deactivate.
          
So the resource agent tries to double check membership information,
as it seems to think it is smarter than Pacemaker.

So what does that gain us above just trusting pacemaker?

What does that gain us above
start:
   strip all current tags
   tag with my node name
   activate

(If we insist on useing tags,
for whatever other reason we may have to use them)

and, for monitor, you could add a $role=Stopped monitor action, to
double check that it is not started where it is supposed to be stopped.
The normal monitoring will only check that it is started where it is
supposed to be started from Pacemakers point of view.


The thing is, Pacemaker primitives will only be started on one node.
If that node leaves the membership, it will be stonithed, to make sure
it is really gone, before starting the primitive somewhere else.

So why would the resource agent need to double check that pacemaker did
the right thing? Why would the resource agent think it is in a better
position to determine wheter or not it is started somewhere else,
if it relies on the exact same infrastructure that pacemaker relies on?

What about "split brain":
exclusivity then can only be ensured by reliable stonith.

If that is in place, pacemaker has already made sure
that this is started exclusively.

If that is not in place, you get data corruption
wheter you configured that primitive to be "exclusive" or not:
the currently active node "owns" the VG, but is not in the membership
of the node that is about to activate it, which will simply relable the
thing and activate it anyways.

   => setting "exclusive=1" attribute makes you "feel" safer,
   but you are not.
   That is a Bad Thing.

In the github comment function,
Lon wrote:
  > I believe the tagging/stripping and the way it's implemented is designed
  > to prevent a few things:
  > 1) obvious administrative errors within a running cluster: - clone
  >    resource that really must NEVER be cloned - executing agent directly
  >    while it's active (and other things)

  So we don't trust the Admin.
  What if the admin forgets to add "exclusive=1" to the primitive?

  If he does not foget that, it is sufficient to just fail all actions
  (but stop), if we $exclusive and operated as clone.

  > 2) bugs in the resource manager - betting your entire volume group
  > on "no bugs"?
  > "Don't do that" only goes so far, and is little comfort to an
  > administrator who has corrupt LVM metadata.
  
  Uhm, ok.
  So the failure scenario is that a Pacemaker bug could lead to
  start of one primitive on more than one node,
  and the resource agent is supposed to detect that and fail.

  So you don't trust Pacemaker, but you trust corosync and stonith,
  and you trust that your resource agent by checking tags against
  membership before overriding them gets it right.

  Also, once a supposedly exclusive VG is activated concurrently,
  chances are that potential LVM *meta* data corruption is less of a
  concern: you already have *data* corruption due to concurrent
  modifications, one node doing journal replays of stuff that is live on
  the other.

  > There's probably some other bits and pieces I've forgotten;
  > Jon Brassow would know.

Hey Jon ;-)

Anyone, any input?

Is there any real use case for this implementation other than
"I don't trust Pacemaker to be able to count to 1,
but I still rely on the rest of the infrastructure"?

Maybe that is a valid scenario.
I just feel this is a layer violation,
and it results in a false sense of safety,
which it actually does not provide.

It seems to try to "simulate" scsi3 persistent reservations
with unsuitable means.

What I'm suggesting is to clearly define
what the goal of this "exclusive" feature is to be.

Then check again if we really want that,
or if it is actually already covered
by Pacemaker some way or other.

Maybe this has been originally implemented on request of some customer,
which was happy once he was able to say "exclusive=1", without thinking
about technical details?

Or maybe I'm just missing the point completely.

Thanks,
        Lars

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to