Re: [Linux-HA] Problem promoting slave to master

Andreas Kurz Thu, 14 Mar 2013 17:04:25 -0700

On 2013-03-14 15:52, Fredrik Hudner wrote:
> I set no-quorum-policy to ignore and removed the constraint you mentioned.
> It then managed to failover once to the slave node, but I still have those.
> 
> Failed actions:
> 
>      p_exportfs_root:0_monitor_
>>
>> 30000 (node=testclu01, call=12, rc=7,
>>   status=complete): not running
>>
>>      p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7,
>>   status=complete): not running


This only tells you that monitoring of these resources found them once
not running .... logs should tell you what & when that happens

> 
> I then stoped the new maste-node to see if it fell over to the other node
> with no success.. It remains slave.

Hard to say without seeing current cluster state like a "crm_mon -1frA",
"cat /proc/drbd" and some logs ... not enough information ...

> I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs
> was back in the crm configure. Seems like cib makes a replace

This constraint is added by the DRBD primary if it looses connection to
its peer and is perfectly fine if you stopped one node.


> Mar 14 15:06:18 [1786] tdtestclu02       crmd:     info:
> abort_transition_graph:        te_update_diff:126 - Triggered transition
> abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) : Non-status
> change
> Mar 14 15:06:18 [1786] tdtestclu02       crmd:   notice:
> do_state_transition:   State transition S_IDLE -> S_POLICY_ENGINE [
> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Mar 14 15:06:18 [1781] tdtestclu02        cib:     info:
> cib_replace_notify:    Replaced: 0.780.39 -> 0.781.1 from tdtestclu01
> 
> So not sure how to remove that constraint on a permanent basis.. it's gone
> as long as I don't stop pacemaker.

Once the DRBD resync is finished it will be removed from the cluster
configuration again automatically... you typically never need to remove
such drbd-fence-constraints manually only in some rare failure scenarios.

Regards,
Andreas


> 
> But it used to work booth with the no-quorom-policy=freeze and that
> constraint
> 
> Kind regards
> /Fredrik
> 
> 
> 
> On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz <[email protected]> wrote:
> 
>> On 2013-03-14 13:30, Fredrik Hudner wrote:
>>> Hi all,
>>>
>>> I have a problem after I removed a node with the force command from my
>> crm
>>> config.
>>>
>>> Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
>>> pacemaker 1.1.7-6.el6)
>>>
>>>
>>>
>>> Then I wanted to add a third node acting as quorum node, but was not able
>>> to get it to work (probably because I don’t understand how to set it up).
>>>
>>> So I removed the 3rd node, but had to use the force command as crm
>>> complained when I tried to remove it.
>>>
>>>
>>>
>>> Now when I start up Pacemaker the resources doesn’t look like they come
>> up
>>> correctly
>>>
>>>
>>>
>>> Online: [ testclu01 testclu02 ]
>>>
>>>
>>>
>>> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
>>>
>>>      Masters: [ testclu01 ]
>>>
>>>      Slaves: [ testclu02 ]
>>>
>>> Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
>>>
>>>      Started: [ tdtestclu01 tdtestclu02 ]
>>>
>>> Resource Group: g_nfs
>>>
>>>      p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
>>>
>>>      p_fs_shared        (ocf::heartbeat:Filesystem):    Started testclu01
>>>
>>>      p_fs_shared2       (ocf::heartbeat:Filesystem):    Started testclu01
>>>
>>>      p_ip_nfs   (ocf::heartbeat:IPaddr2):       Started testclu01
>>>
>>> Clone Set: cl_exportfs_root [p_exportfs_root]
>>>
>>>      Started: [ testclu01 testclu02 ]
>>>
>>>
>>>
>>> Failed actions:
>>>
>>>     p_exportfs_root:0_monitor_30000 (node=testclu01, call=12, rc=7,
>>> status=complete): not running
>>>
>>>     p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7,
>>> status=complete): not running
>>>
>>>
>>>
>>> The filesystems mount correctly on the master at this stage and can be
>>> written to.
>>>
>>> When I stop the services on the master node for it to failover, it
>> doesn’t
>>> work.. Looses cluster-ip connectivity
>>
>> fix your "no-quorum-policy", you want to "ignore" the quorum in a
>> two-node cluster to allow failover ... and if your drbd device is
>> already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs
>> constraint.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>>
>>>
>>> Corosync.log from master after I stopped pacemaker on master node :  see
>>> attached file
>>>
>>>
>>>
>>> Additional files (attached): crm-configure show
>>>
>>>                                                           Corosync.conf
>>>
>>>
>> Global_common.conf
>>>
>>>
>>>
>>>
>>>
>>> I’m not sure how to proceed to get it up in a fair state now
>>>
>>> So if anyone could help me it would be much appreciated
>>>
>>>
>>>
>>> Kind regards
>>>
>>> /Fredrik
>>>
>>>
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem promoting slave to master

Reply via email to