Re: [ClusterLabs Developers] OCF_RESKEY_CRM_meta_notify_active_* always empty

2016-07-30 Thread Andrew Beekhof


Sent from my iPhone

> On 30 Jul 2016, at 8:32 AM, Ken Gaillot  wrote:
> 
> I finally had time to investigate this, and it definitely is broken.
> 
> The only existing heartbeat RA to use the *_notify_active_* variables is
> Filesystem, and it only does so for OCFS2 on SLES10, which didn't even
> ship pacemaker,

I'm pretty sure it did

> so I'm guessing it's been broken from the beginning of
> pacemaker.
> 
> The fix looks straightforward, so I should be able to take care of it soon.
> 
> Filed bug http://bugs.clusterlabs.org/show_bug.cgi?id=5295
> 
>> On 05/08/2016 04:57 AM, Jehan-Guillaume de Rorthais wrote:
>> Le Fri, 6 May 2016 15:41:11 -0500,
>> Ken Gaillot  a écrit :
>> 
 On 05/03/2016 05:30 PM, Jehan-Guillaume de Rorthais wrote:
 Le Tue, 3 May 2016 21:10:12 +0200,
 Jehan-Guillaume de Rorthais  a écrit :
 
> Le Mon, 2 May 2016 17:59:55 -0500,
> Ken Gaillot  a écrit :
> 
>>> On 04/28/2016 04:47 AM, Jehan-Guillaume de Rorthais wrote:
>>> Hello all,
>>> 
>>> While testing and experiencing with our RA for PostgreSQL, I found the
>>> meta_notify_active_* variables seems always empty. Here is an example of
>>> these variables as they are seen from our RA during a
>>> migration/switchover:
>>> 
>>> 
>>>  {
>>>'type' => 'pre',
>>>'operation' => 'demote',
>>>'active' => [],
>>>'inactive' => [],
>>>'start' => [],
>>>'stop' => [],
>>>'demote' => [
>>>  {
>>>'rsc' => 'pgsqld:1',
>>>'uname' => 'hanode1'
>>>  }
>>>],
>>> 
>>>'master' => [
>>>  {
>>>'rsc' => 'pgsqld:1',
>>>'uname' => 'hanode1'
>>>  }
>>>],
>>> 
>>>'promote' => [
>>>   {
>>> 'rsc' => 'pgsqld:0',
>>> 'uname' => 'hanode3'
>>>   }
>>> ],
>>>'slave' => [
>>> {
>>>   'rsc' => 'pgsqld:0',
>>>   'uname' => 'hanode3'
>>> },
>>> {
>>>   'rsc' => 'pgsqld:2',
>>>   'uname' => 'hanode2'
>>> }
>>>   ],
>>> 
>>>  }
>>> 
>>> In case this comes from our side, here is code building this:
>>> 
>>>  
>>> https://github.com/dalibo/PAF/blob/6e86284bc647ef1e81f01f047f1862e40ba62906/lib/OCF_Functions.pm#L444
>>> 
>>> But looking at the variable itself in debug logs, I always find it 
>>> empty,
>>> in various situations (switchover, recover, failover).
>>> 
>>> If I understand the documentation correctly, I would expect 'active' to
>>> list all the three resources, shouldn't it? Currently, to bypass this, 
>>> we
>>> consider: active == master + slave
>> 
>> You're right, it should. The pacemaker code that generates the "active"
>> variables is the same used for "demote" etc., so it seems unlikely the
>> issue is on pacemaker's side. Especially since your code treats active
>> etc. differently from demote etc., it seems like it must be in there
>> somewhere, but I don't see where.
> 
> The code treat active, inactive, start and stop all together, for any
> cloned resource. If the resource is a multistate, it adds promote, demote,
> slave and master.
> 
> Note that from this piece of code, the 7 other notify vars are set
> correctly: start, stop, inactive, promote, demote, slave, master. Only
> active is always missing.
> 
> I'll investigate and try to find where is hiding the bug.
 
 So I added a piece of code to dump the **all** the environment variables to
 a temp file as early as possible **to avoid any interaction with our perl
 module** in the code of the RA, ie.:
 
  BEGIN {
use Time::HiRes qw(time);
my $now = time;
open my $fh, ">", "/tmp/test-$now.env.txt";
printf($fh "%-20s = ''%s''\n", $_, $ENV{$_}) foreach sort keys %ENV;
  }
 
 Then I started my cluster and set maintenance-mode=false while no resources
 where running. So the debug files contains the probe action, start on all
 nodes, one promote on the master and the first monitors. The "*active"
 variables are always empty anywhere in the cluster. Find in attachment the
 result of the following command on the master node:
 
  for i in test-*; do echo "= $i ="; grep OCF_ $i; done >
 debug-env.txt
 
 I'm using Pacemaker 1.1.13-10.el7_2.2-44eb2dd under CentOS 7.2.1511.
 
 For completeness, I added the Pacemaker configuration I use for my 3 

Re: [ClusterLabs Developers] OCF_RESKEY_CRM_meta_notify_active_* always empty

2016-07-30 Thread Andrew Beekhof
Urgh. I must be confused with sles11. 
In any case, the first version of pacemaker was identical to the last heartbeat 
crm. 

I don't recall the ocfs2 agent changing design while I was there, so 11 may be 
broken too

Sent from my iPhone

> On 30 Jul 2016, at 8:51 AM, Ken Gaillot  wrote:
> 
>> On 07/29/2016 05:41 PM, Andrew Beekhof wrote:
>> 
>> 
>> Sent from my iPhone
>> 
>>> On 30 Jul 2016, at 8:32 AM, Ken Gaillot  wrote:
>>> 
>>> I finally had time to investigate this, and it definitely is broken.
>>> 
>>> The only existing heartbeat RA to use the *_notify_active_* variables is
>>> Filesystem, and it only does so for OCFS2 on SLES10, which didn't even
>>> ship pacemaker,
>> 
>> I'm pretty sure it did
> 
> All I could find was:
> 
> "SLES 10 did not yet ship pacemaker, but heartbeat with the builtin crm"
> 
> http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022232.html
> 
> I'm sure people were compiling it, and ClusterLabs probably even
> provided a repo, but it looks like sles didn't ship it.
> 
> The issue is that the code that builds the active list checks for role
> RSC_ROLE_STARTED rather than RSC_ROLE_SLAVE + RSC_ROLE_MASTER, so I
> don't think it ever would have worked.
> 
>> 
>>> so I'm guessing it's been broken from the beginning of
>>> pacemaker.
>>> 
>>> The fix looks straightforward, so I should be able to take care of it soon.
>>> 
>>> Filed bug http://bugs.clusterlabs.org/show_bug.cgi?id=5295
>>> 
 On 05/08/2016 04:57 AM, Jehan-Guillaume de Rorthais wrote:
 Le Fri, 6 May 2016 15:41:11 -0500,
 Ken Gaillot  a écrit :
 
>> On 05/03/2016 05:30 PM, Jehan-Guillaume de Rorthais wrote:
>> Le Tue, 3 May 2016 21:10:12 +0200,
>> Jehan-Guillaume de Rorthais  a écrit :
>> 
>>> Le Mon, 2 May 2016 17:59:55 -0500,
>>> Ken Gaillot  a écrit :
>>> 
> On 04/28/2016 04:47 AM, Jehan-Guillaume de Rorthais wrote:
> Hello all,
> 
> While testing and experiencing with our RA for PostgreSQL, I found the
> meta_notify_active_* variables seems always empty. Here is an example 
> of
> these variables as they are seen from our RA during a
> migration/switchover:
> 
> 
> {
>   'type' => 'pre',
>   'operation' => 'demote',
>   'active' => [],
>   'inactive' => [],
>   'start' => [],
>   'stop' => [],
>   'demote' => [
> {
>   'rsc' => 'pgsqld:1',
>   'uname' => 'hanode1'
> }
>   ],
> 
>   'master' => [
> {
>   'rsc' => 'pgsqld:1',
>   'uname' => 'hanode1'
> }
>   ],
> 
>   'promote' => [
>  {
>'rsc' => 'pgsqld:0',
>'uname' => 'hanode3'
>  }
>],
>   'slave' => [
>{
>  'rsc' => 'pgsqld:0',
>  'uname' => 'hanode3'
>},
>{
>  'rsc' => 'pgsqld:2',
>  'uname' => 'hanode2'
>}
>  ],
> 
> }
> 
> In case this comes from our side, here is code building this:
> 
> https://github.com/dalibo/PAF/blob/6e86284bc647ef1e81f01f047f1862e40ba62906/lib/OCF_Functions.pm#L444
> 
> But looking at the variable itself in debug logs, I always find it 
> empty,
> in various situations (switchover, recover, failover).
> 
> If I understand the documentation correctly, I would expect 'active' 
> to
> list all the three resources, shouldn't it? Currently, to bypass 
> this, we
> consider: active == master + slave
 
 You're right, it should. The pacemaker code that generates the "active"
 variables is the same used for "demote" etc., so it seems unlikely the
 issue is on pacemaker's side. Especially since your code treats active
 etc. differently from demote etc., it seems like it must be in there
 somewhere, but I don't see where.
>>> 
>>> The code treat active, inactive, start and stop all together, for any
>>> cloned resource. If the resource is a multistate, it adds promote, 
>>> demote,
>>> slave and master.
>>> 
>>> Note that from this piece of code, the 7 other notify vars are set
>>> correctly: start, stop, inactive, promote, demote, slave, master. Only
>>> active is always missing.
>>> 
>>> I'll investigate and try to find where is hiding