Re: [Linux-HA] Fwd: Problem promoting slave to master

Andreas Kurz Wed, 20 Mar 2013 07:59:09 -0700

On 2013-03-20 13:30, Fredrik Hudner wrote:
> I presume you are correct about that. (see drbdadm-dump.txt)
> 
> fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
> after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;


after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;

... to remove the constraint, once secondary is in sync again after a
resync run.

Regards,
Andreas

> 
> What would I need to do to overwrite it ?
> Or if you have a nicer way to do it.. It's not easy to take over someones
> else configuration always
> 
> Kind regards
> /Fredrik
> 
> On Tue, Mar 19, 2013 at 11:32 PM, Andreas Kurz <[email protected]> wrote:
> 
>> On 2013-03-19 16:02, Fredrik Hudner wrote:
>>> Just wanted to change what document it*s been built from.. It should be
>>> "LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6
>>
>> There is again that fencing-constraint in your configuration .... what
>> does "drbdadm dump all" look like? Any chance you only specified a
>> fence-peer handler in you resource configuration but don't overwrite
>> that after-resync-target handler you specified in your
>> global_common.conf ... that would explain that dangling constraint that
>> will prevent a failover.
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Fredrik Hudner <[email protected]>
>>> Date: Mon, Mar 18, 2013 at 11:06 AM
>>> Subject: Re: [Linux-HA] Problem promoting slave to master
>>> To: General Linux-HA mailing list <[email protected]>
>>>
>>>
>>>
>>>
>>> On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz <[email protected]>
>> wrote:
>>>
>>>> On 2013-03-14 15:52, Fredrik Hudner wrote:
>>>>> I set no-quorum-policy to ignore and removed the constraint you
>>>> mentioned.
>>>>> It then managed to failover once to the slave node, but I still have
>>>> those.
>>>>>
>>>>> Failed actions:
>>>>>
>>>>>      p_exportfs_root:0_monitor_
>>>>>>
>>>>>> 30000 (node=testclu01, call=12, rc=7,
>>>>>>   status=complete): not running
>>>>>>
>>>>>>      p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7,
>>>>>>   status=complete): not running
>>>>
>>>> This only tells you that monitoring of these resources found them once
>>>> not running .... logs should tell you what & when that happens
>>>>
>>>
>>> I have attached the logs from master and slave.. I can see that it stops,
>>> but not really why (to limited knowledge to read the logs)
>>>
>>>>
>>>>>
>>>>> I then stoped the new maste-node to see if it fell over to the other
>> node
>>>>> with no success.. It remains slave.
>>>>
>>>> Hard to say without seeing current cluster state like a "crm_mon -1frA",
>>>> "cat /proc/drbd" and some logs ... not enough information ...
>>>>
>>>> I have attached the output from crm_mon, the new crm configure and
>>> /proc/drbd
>>>
>>>
>>>>> I also noticed that the constraint
>> drbd-fence-by-handler-nfs-ms_drbd_nfs
>>>>> was back in the crm configure. Seems like cib makes a replace
>>>>
>>>> This constraint is added by the DRBD primary if it looses connection to
>>>> its peer and is perfectly fine if you stopped one node.
>>>>
>>>> Seems like the cluster have a problem attaching to the cluster node ip,
>>> but I'm not sure why
>>>
>>> i would like to add, that I took over this configuration from a guy that
>>> has left, but I know that it's configured by using the technical
>>> documentation from LINBIT "Highly available NFS storage with DRBD and
>>> Pacemaker".
>>>
>>>>
>>>>> Mar 14 15:06:18 [1786] tdtestclu02       crmd:     info:
>>>>> abort_transition_graph:        te_update_diff:126 - Triggered
>> transition
>>>>> abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) :
>>>> Non-status
>>>>> change
>>>>> Mar 14 15:06:18 [1786] tdtestclu02       crmd:   notice:
>>>>> do_state_transition:   State transition S_IDLE -> S_POLICY_ENGINE [
>>>>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>>>> Mar 14 15:06:18 [1781] tdtestclu02        cib:     info:
>>>>> cib_replace_notify:    Replaced: 0.780.39 -> 0.781.1 from tdtestclu01
>>>>>
>>>>> So not sure how to remove that constraint on a permanent basis.. it's
>>>> gone
>>>>> as long as I don't stop pacemaker.
>>>>
>>>> Once the DRBD resync is finished it will be removed from the cluster
>>>> configuration again automatically... you typically never need to remove
>>>> such drbd-fence-constraints manually only in some rare failure
>> scenarios.
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>>
>>>>>
>>>>> But it used to work booth with the no-quorom-policy=freeze and that
>>>>> constraint
>>>>>
>>>>> Kind regards
>>>>> /Fredrik
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz <[email protected]>
>>>> wrote:
>>>>>
>>>>>> On 2013-03-14 13:30, Fredrik Hudner wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a problem after I removed a node with the force command from
>> my
>>>>>> crm
>>>>>>> config.
>>>>>>>
>>>>>>> Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
>>>>>>> pacemaker 1.1.7-6.el6)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Then I wanted to add a third node acting as quorum node, but was not
>>>> able
>>>>>>> to get it to work (probably because I don’t understand how to set it
>>>> up).
>>>>>>>
>>>>>>> So I removed the 3rd node, but had to use the force command as crm
>>>>>>> complained when I tried to remove it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Now when I start up Pacemaker the resources doesn’t look like they
>> come
>>>>>> up
>>>>>>> correctly
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Online: [ testclu01 testclu02 ]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
>>>>>>>
>>>>>>>      Masters: [ testclu01 ]
>>>>>>>
>>>>>>>      Slaves: [ testclu02 ]
>>>>>>>
>>>>>>> Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
>>>>>>>
>>>>>>>      Started: [ tdtestclu01 tdtestclu02 ]
>>>>>>>
>>>>>>> Resource Group: g_nfs
>>>>>>>
>>>>>>>      p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
>>>>>>>
>>>>>>>      p_fs_shared        (ocf::heartbeat:Filesystem):    Started
>>>> testclu01
>>>>>>>
>>>>>>>      p_fs_shared2       (ocf::heartbeat:Filesystem):    Started
>>>> testclu01
>>>>>>>
>>>>>>>      p_ip_nfs   (ocf::heartbeat:IPaddr2):       Started testclu01
>>>>>>>
>>>>>>> Clone Set: cl_exportfs_root [p_exportfs_root]
>>>>>>>
>>>>>>>      Started: [ testclu01 testclu02 ]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Failed actions:
>>>>>>>
>>>>>>>     p_exportfs_root:0_monitor_30000 (node=testclu01, call=12, rc=7,
>>>>>>> status=complete): not running
>>>>>>>
>>>>>>>     p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7,
>>>>>>> status=complete): not running
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The filesystems mount correctly on the master at this stage and can
>> be
>>>>>>> written to.
>>>>>>>
>>>>>>> When I stop the services on the master node for it to failover, it
>>>>>> doesn’t
>>>>>>> work.. Looses cluster-ip connectivity
>>>>>>
>>>>>> fix your "no-quorum-policy", you want to "ignore" the quorum in a
>>>>>> two-node cluster to allow failover ... and if your drbd device is
>>>>>> already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs
>>>>>> constraint.
>>>>>>
>>>>>> Regards,
>>>>>> Andreas
>>>>>>
>>>>>> --
>>>>>> Need help with Pacemaker?
>>>>>> http://www.hastexo.com/now
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Corosync.log from master after I stopped pacemaker on master node :
>>>>  see
>>>>>>> attached file
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Additional files (attached): crm-configure show
>>>>>>>
>>>>>>>
>> Corosync.conf
>>>>>>>
>>>>>>>
>>>>>> Global_common.conf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I’m not sure how to proceed to get it up in a fair state now
>>>>>>>
>>>>>>> So if anyone could help me it would be much appreciated
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Kind regards
>>>>>>>
>>>>>>> /Fredrik
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> 
> 
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fwd: Problem promoting slave to master

Reply via email to