Sent from my iPhone
> On 30 Jul 2016, at 8:32 AM, Ken Gaillot <kgail...@redhat.com> wrote: > > I finally had time to investigate this, and it definitely is broken. > > The only existing heartbeat RA to use the *_notify_active_* variables is > Filesystem, and it only does so for OCFS2 on SLES10, which didn't even > ship pacemaker, I'm pretty sure it did > so I'm guessing it's been broken from the beginning of > pacemaker. > > The fix looks straightforward, so I should be able to take care of it soon. > > Filed bug http://bugs.clusterlabs.org/show_bug.cgi?id=5295 > >> On 05/08/2016 04:57 AM, Jehan-Guillaume de Rorthais wrote: >> Le Fri, 6 May 2016 15:41:11 -0500, >> Ken Gaillot <kgail...@redhat.com> a écrit : >> >>>> On 05/03/2016 05:30 PM, Jehan-Guillaume de Rorthais wrote: >>>> Le Tue, 3 May 2016 21:10:12 +0200, >>>> Jehan-Guillaume de Rorthais <j...@dalibo.com> a écrit : >>>> >>>>> Le Mon, 2 May 2016 17:59:55 -0500, >>>>> Ken Gaillot <kgail...@redhat.com> a écrit : >>>>> >>>>>>> On 04/28/2016 04:47 AM, Jehan-Guillaume de Rorthais wrote: >>>>>>> Hello all, >>>>>>> >>>>>>> While testing and experiencing with our RA for PostgreSQL, I found the >>>>>>> meta_notify_active_* variables seems always empty. Here is an example of >>>>>>> these variables as they are seen from our RA during a >>>>>>> migration/switchover: >>>>>>> >>>>>>> >>>>>>> { >>>>>>> 'type' => 'pre', >>>>>>> 'operation' => 'demote', >>>>>>> 'active' => [], >>>>>>> 'inactive' => [], >>>>>>> 'start' => [], >>>>>>> 'stop' => [], >>>>>>> 'demote' => [ >>>>>>> { >>>>>>> 'rsc' => 'pgsqld:1', >>>>>>> 'uname' => 'hanode1' >>>>>>> } >>>>>>> ], >>>>>>> >>>>>>> 'master' => [ >>>>>>> { >>>>>>> 'rsc' => 'pgsqld:1', >>>>>>> 'uname' => 'hanode1' >>>>>>> } >>>>>>> ], >>>>>>> >>>>>>> 'promote' => [ >>>>>>> { >>>>>>> 'rsc' => 'pgsqld:0', >>>>>>> 'uname' => 'hanode3' >>>>>>> } >>>>>>> ], >>>>>>> 'slave' => [ >>>>>>> { >>>>>>> 'rsc' => 'pgsqld:0', >>>>>>> 'uname' => 'hanode3' >>>>>>> }, >>>>>>> { >>>>>>> 'rsc' => 'pgsqld:2', >>>>>>> 'uname' => 'hanode2' >>>>>>> } >>>>>>> ], >>>>>>> >>>>>>> } >>>>>>> >>>>>>> In case this comes from our side, here is code building this: >>>>>>> >>>>>>> >>>>>>> https://github.com/dalibo/PAF/blob/6e86284bc647ef1e81f01f047f1862e40ba62906/lib/OCF_Functions.pm#L444 >>>>>>> >>>>>>> But looking at the variable itself in debug logs, I always find it >>>>>>> empty, >>>>>>> in various situations (switchover, recover, failover). >>>>>>> >>>>>>> If I understand the documentation correctly, I would expect 'active' to >>>>>>> list all the three resources, shouldn't it? Currently, to bypass this, >>>>>>> we >>>>>>> consider: active == master + slave >>>>>> >>>>>> You're right, it should. The pacemaker code that generates the "active" >>>>>> variables is the same used for "demote" etc., so it seems unlikely the >>>>>> issue is on pacemaker's side. Especially since your code treats active >>>>>> etc. differently from demote etc., it seems like it must be in there >>>>>> somewhere, but I don't see where. >>>>> >>>>> The code treat active, inactive, start and stop all together, for any >>>>> cloned resource. If the resource is a multistate, it adds promote, demote, >>>>> slave and master. >>>>> >>>>> Note that from this piece of code, the 7 other notify vars are set >>>>> correctly: start, stop, inactive, promote, demote, slave, master. Only >>>>> active is always missing. >>>>> >>>>> I'll investigate and try to find where is hiding the bug. >>>> >>>> So I added a piece of code to dump the **all** the environment variables to >>>> a temp file as early as possible **to avoid any interaction with our perl >>>> module** in the code of the RA, ie.: >>>> >>>> BEGIN { >>>> use Time::HiRes qw(time); >>>> my $now = time; >>>> open my $fh, ">", "/tmp/test-$now.env.txt"; >>>> printf($fh "%-20s = ''%s''\n", $_, $ENV{$_}) foreach sort keys %ENV; >>>> } >>>> >>>> Then I started my cluster and set maintenance-mode=false while no resources >>>> where running. So the debug files contains the probe action, start on all >>>> nodes, one promote on the master and the first monitors. The "*active" >>>> variables are always empty anywhere in the cluster. Find in attachment the >>>> result of the following command on the master node: >>>> >>>> for i in test-*; do echo "===== $i ====="; grep OCF_ $i; done > >>>> debug-env.txt >>>> >>>> I'm using Pacemaker 1.1.13-10.el7_2.2-44eb2dd under CentOS 7.2.1511. >>>> >>>> For completeness, I added the Pacemaker configuration I use for my 3 node >>>> dev/test cluster. >>>> >>>> Let me know if you think of more investigations and test I could run on >>>> this >>>> issue. I'm out of ideas for tonight (and I really would prefer having this >>>> bug on my side). >>> >>> From your environment dumps, what I think is happening is that you are >>> getting multiple notifications (start, pre-promote, post-promote) in a >>> single cluster transition. So the variables reflect the initial state of >>> that transition -- none of the instances are active, all three are being >>> started (so the nodes are in the "*_start_*" variables), and one is >>> being promoted. >> >> >> Yes, this is what happening here. It's embarrassing I didn't thought about >> that :) >> >>> The starts will be done before the promote. If one of the starts fails, >>> the transition will be aborted, and a new one will be calculated. So, if >>> you get to the promote, you can assume anything in "*_start_*" is now >>> active. >> >> I did another simple test: >> >> * 3 ms clones are running on hanode1 hanode2 hanode3 >> * master role is on hanode1 >> * I move the master role to hanode 2 using: >> "pcs resource move pgsql-ha hanode2 --master" >> >> The transition gives us: >> >> * demote on hanode1 >> * promote en hanode2 >> >> I suppose all the three clone on hanode1, hanode2 and hanode3 should appear >> in >> active env variable in this context, isn't it? >> >> Please, find in attachment the environment dumps of this transition from >> hanode1. You'll see both "OCF_RESKEY_CRM_meta_notify_active_resource" and >> "OCF_RESKEY_CRM_meta_notify_active_uname" only contains one char: a space. >> >> I start looking at the Pacemaker code, at least to have a better >> understanding >> on where environment variables are set and when they are available. I was out >> of luck so far but I lack of time. Any pointers would be appreciated :) >> >>>> On a side note, I noticed with these debug files that the notify >>>> variables where also available outside of notify actions (start and notify >>>> here). Are they always available during "transition actions" (start, stop, >>>> promote, demote)? Checking at the mysql RA, they are using >>>> OCF_RESKEY_CRM_meta_notify_master_uname during the start action. So I >>>> suppose it's safe? >>> >>> Good question, I've never tried that before. I'm reluctant to say it's >>> guaranteed; it's possible seeing them in the start action is a side >>> effect of the current implementation and could theoretically change in >>> the future. But if mysql is relying on it, I suppose it's >>> well-established already, making changing it unlikely ... >> >> Thank you very much for this clarification. Presently we keep in a private >> attribute what we //think// (we can not rely on active_uname :/) are the >> active >> uname for the ms resource. As it seems the notify vars appears outside of >> notify >> action is just a side effect of the current implementation, I prefer to stay >> away from them when we are not in a notify action and keep our current >> implementation. >> >> Thank you, > > > _______________________________________________ > Developers mailing list > Developers@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/developers _______________________________________________ Developers mailing list Developers@clusterlabs.org http://clusterlabs.org/mailman/listinfo/developers