Re: [Linux-HA] Fwd: Problem promoting slave to master

Fredrik Hudner Wed, 20 Mar 2013 05:30:24 -0700

I presume you are correct about that. (see drbdadm-dump.txt)

fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;


What would I need to do to overwrite it ?
Or if you have a nicer way to do it.. It's not easy to take over someones
else configuration always

Kind regards
/Fredrik

On Tue, Mar 19, 2013 at 11:32 PM, Andreas Kurz <[email protected]> wrote:

> On 2013-03-19 16:02, Fredrik Hudner wrote:
> > Just wanted to change what document it*s been built from.. It should be
> > "LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6
>
> There is again that fencing-constraint in your configuration .... what
> does "drbdadm dump all" look like? Any chance you only specified a
> fence-peer handler in you resource configuration but don't overwrite
> that after-resync-target handler you specified in your
> global_common.conf ... that would explain that dangling constraint that
> will prevent a failover.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > ---------- Forwarded message ----------
> > From: Fredrik Hudner <[email protected]>
> > Date: Mon, Mar 18, 2013 at 11:06 AM
> > Subject: Re: [Linux-HA] Problem promoting slave to master
> > To: General Linux-HA mailing list <[email protected]>
> >
> >
> >
> >
> > On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz <[email protected]>
> wrote:
> >
> >> On 2013-03-14 15:52, Fredrik Hudner wrote:
> >>> I set no-quorum-policy to ignore and removed the constraint you
> >> mentioned.
> >>> It then managed to failover once to the slave node, but I still have
> >> those.
> >>>
> >>> Failed actions:
> >>>
> >>>      p_exportfs_root:0_monitor_
> >>>>
> >>>> 30000 (node=testclu01, call=12, rc=7,
> >>>>   status=complete): not running
> >>>>
> >>>>      p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7,
> >>>>   status=complete): not running
> >>
> >> This only tells you that monitoring of these resources found them once
> >> not running .... logs should tell you what & when that happens
> >>
> >
> > I have attached the logs from master and slave.. I can see that it stops,
> > but not really why (to limited knowledge to read the logs)
> >
> >>
> >>>
> >>> I then stoped the new maste-node to see if it fell over to the other
> node
> >>> with no success.. It remains slave.
> >>
> >> Hard to say without seeing current cluster state like a "crm_mon -1frA",
> >> "cat /proc/drbd" and some logs ... not enough information ...
> >>
> >> I have attached the output from crm_mon, the new crm configure and
> > /proc/drbd
> >
> >
> >>> I also noticed that the constraint
> drbd-fence-by-handler-nfs-ms_drbd_nfs
> >>> was back in the crm configure. Seems like cib makes a replace
> >>
> >> This constraint is added by the DRBD primary if it looses connection to
> >> its peer and is perfectly fine if you stopped one node.
> >>
> >> Seems like the cluster have a problem attaching to the cluster node ip,
> > but I'm not sure why
> >
> > i would like to add, that I took over this configuration from a guy that
> > has left, but I know that it's configured by using the technical
> > documentation from LINBIT "Highly available NFS storage with DRBD and
> > Pacemaker".
> >
> >>
> >>> Mar 14 15:06:18 [1786] tdtestclu02       crmd:     info:
> >>> abort_transition_graph:        te_update_diff:126 - Triggered
> transition
> >>> abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) :
> >> Non-status
> >>> change
> >>> Mar 14 15:06:18 [1786] tdtestclu02       crmd:   notice:
> >>> do_state_transition:   State transition S_IDLE -> S_POLICY_ENGINE [
> >>> input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> >>> Mar 14 15:06:18 [1781] tdtestclu02        cib:     info:
> >>> cib_replace_notify:    Replaced: 0.780.39 -> 0.781.1 from tdtestclu01
> >>>
> >>> So not sure how to remove that constraint on a permanent basis.. it's
> >> gone
> >>> as long as I don't stop pacemaker.
> >>
> >> Once the DRBD resync is finished it will be removed from the cluster
> >> configuration again automatically... you typically never need to remove
> >> such drbd-fence-constraints manually only in some rare failure
> scenarios.
> >>
> >> Regards,
> >> Andreas
> >>
> >>
> >>>
> >>> But it used to work booth with the no-quorom-policy=freeze and that
> >>> constraint
> >>>
> >>> Kind regards
> >>> /Fredrik
> >>>
> >>>
> >>>
> >>> On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz <[email protected]>
> >> wrote:
> >>>
> >>>> On 2013-03-14 13:30, Fredrik Hudner wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I have a problem after I removed a node with the force command from
> my
> >>>> crm
> >>>>> config.
> >>>>>
> >>>>> Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
> >>>>> pacemaker 1.1.7-6.el6)
> >>>>>
> >>>>>
> >>>>>
> >>>>> Then I wanted to add a third node acting as quorum node, but was not
> >> able
> >>>>> to get it to work (probably because I don’t understand how to set it
> >> up).
> >>>>>
> >>>>> So I removed the 3rd node, but had to use the force command as crm
> >>>>> complained when I tried to remove it.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Now when I start up Pacemaker the resources doesn’t look like they
> come
> >>>> up
> >>>>> correctly
> >>>>>
> >>>>>
> >>>>>
> >>>>> Online: [ testclu01 testclu02 ]
> >>>>>
> >>>>>
> >>>>>
> >>>>> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
> >>>>>
> >>>>>      Masters: [ testclu01 ]
> >>>>>
> >>>>>      Slaves: [ testclu02 ]
> >>>>>
> >>>>> Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
> >>>>>
> >>>>>      Started: [ tdtestclu01 tdtestclu02 ]
> >>>>>
> >>>>> Resource Group: g_nfs
> >>>>>
> >>>>>      p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
> >>>>>
> >>>>>      p_fs_shared        (ocf::heartbeat:Filesystem):    Started
> >> testclu01
> >>>>>
> >>>>>      p_fs_shared2       (ocf::heartbeat:Filesystem):    Started
> >> testclu01
> >>>>>
> >>>>>      p_ip_nfs   (ocf::heartbeat:IPaddr2):       Started testclu01
> >>>>>
> >>>>> Clone Set: cl_exportfs_root [p_exportfs_root]
> >>>>>
> >>>>>      Started: [ testclu01 testclu02 ]
> >>>>>
> >>>>>
> >>>>>
> >>>>> Failed actions:
> >>>>>
> >>>>>     p_exportfs_root:0_monitor_30000 (node=testclu01, call=12, rc=7,
> >>>>> status=complete): not running
> >>>>>
> >>>>>     p_exportfs_root:1_monitor_30000 (node=testclu02, call=12, rc=7,
> >>>>> status=complete): not running
> >>>>>
> >>>>>
> >>>>>
> >>>>> The filesystems mount correctly on the master at this stage and can
> be
> >>>>> written to.
> >>>>>
> >>>>> When I stop the services on the master node for it to failover, it
> >>>> doesn’t
> >>>>> work.. Looses cluster-ip connectivity
> >>>>
> >>>> fix your "no-quorum-policy", you want to "ignore" the quorum in a
> >>>> two-node cluster to allow failover ... and if your drbd device is
> >>>> already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs
> >>>> constraint.
> >>>>
> >>>> Regards,
> >>>> Andreas
> >>>>
> >>>> --
> >>>> Need help with Pacemaker?
> >>>> http://www.hastexo.com/now
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Corosync.log from master after I stopped pacemaker on master node :
> >>  see
> >>>>> attached file
> >>>>>
> >>>>>
> >>>>>
> >>>>> Additional files (attached): crm-configure show
> >>>>>
> >>>>>
> Corosync.conf
> >>>>>
> >>>>>
> >>>> Global_common.conf
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> I’m not sure how to proceed to get it up in a fair state now
> >>>>>
> >>>>> So if anyone could help me it would be much appreciated
> >>>>>
> >>>>>
> >>>>>
> >>>>> Kind regards
> >>>>>
> >>>>> /Fredrik
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Linux-HA mailing list
> >>>>> [email protected]
> >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>>>> See also: http://linux-ha.org/ReportingProblems
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Linux-HA mailing list
> >>>> [email protected]
> >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>>> See also: http://linux-ha.org/ReportingProblems
> >>>>
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> [email protected]
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> See also: http://linux-ha.org/ReportingProblems
> >>>
> >>
> >>
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >>
> >
> >
> >
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Fredrik Hudner
Grosse Pfahlstr 12
30161 Hannover
Germany

Tel: 0511-642 09 548
Mob: 0173-254 39 29

# /etc/drbd.conf
common {
    net {
        protocol           C;
        verify-alg       sha1;
    }
    disk {
        resync-rate      150M;
    }
    startup {
        wfc-timeout       55;
        degr-wfc-timeout  25;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot 
-f";
        local-io-error   "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";
        fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
        after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
    }
}

# resource nfs on tdtestclu02: not ignored, not stacked
# defined at /etc/drbd.d/nfs.res:1
resource nfs {
    on tdtestclu01 {
        volume 0 {
            device       /dev/drbd0 minor 0;
            disk         /dev/sdb1;
            meta-disk    internal;
        }
        volume 1 {
            device       /dev/drbd1 minor 1;
            disk         /dev/sdc1;
            meta-disk    internal;
        }
        address          ipv4 10.240.64.21:7790;
    }
    on tdtestclu02 {
        volume 0 {
            device       /dev/drbd0 minor 0;
            disk         /dev/sdb1;
            meta-disk    internal;
        }
        volume 1 {
            device       /dev/drbd1 minor 1;
            disk         /dev/sdc1;
            meta-disk    internal;
        }
        address          ipv4 10.240.64.22:7790;
    }
}

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fwd: Problem promoting slave to master

Reply via email to