Re: [Linux-HA] Custom resource agent script assistance

Chris Bowlby Fri, 02 Dec 2011 07:20:15 -0800

Hi Andreas, 

 I've made the changes you've suggested, and while the grouping is
working nicely, I'm still getting a "not installed" error for DHCP
itself. However, on closer inspection it still looks like it is
attempting to start DHCP on the secondary node. Here is the updated
configuration based on your changes:


node dhcp-vm01 \
        attributes standby="off"
node dhcp-vm02 \
        attributes standby="off"
primitive DHCPFS ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/var/lib/dhcp"
fstype="ext4" \
        meta target-role="Started"
primitive dhcp-cluster ocf:heartbeat:IPaddr2 \
        params ip="xxx.xxx.xxx.xxx" cidr_netmask="32" \
        op monitor interval="10s"
primitive dhcpd_service ocf:heartbeat:dhcpd \
        params dhcpd_config="/etc/dhcpd.conf" dhcpd_interface="eth0" \
        op monitor interval="1min" \
        meta target-role="Started"
primitive dhcpdrbd ocf:linbit:drbd \
        params drbd_resource="dhcpdata" \
        op monitor interval="60s"
group g_dhcp DHCPFS dhcp-cluster dhcpd_service
ms DHCPData dhcpdrbd \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation fs_on_drbd inf: g_dhcp DHCPData:Master
order dhcpfs_after_dhcpdata inf: DHCPData:promote g_dhcp:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

The error in crm_mond remains as follows:

Failed actions:
    dhcpd_service_monitor_0 (node=dhcp-vm02, call=3, rc=5,
status=complete): not installed

And the logs still report:

Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
dhcpd_service_monitor_0 on dhcp-vm01 returned 5 (not installed) instead
of the expected value: 7 (not running)
Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
error - dhcpd_service_monitor_0 failed with rc=5: Preventing
dhcpd_service from re-starting on dhcp-vm01
Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
dhcpd_service_monitor_0 on dhcp-vm02 returned 5 (not installed) instead
of the expected value: 7 (not running)
Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
error - dhcpd_service_monitor_0 failed with rc=5: Preventing
dhcpd_service from re-starting on dhcp-vm02

As a side note, when I configured the grouping, most of the colo's went
away, and the DRBD:Master colo was updated to use the group, as was the
order statement.

We are getting close, and this gives me confidence that it was not a
major issue in the scripts itself, but more in my crm configuration.

On Fri, 2011-12-02 at 01:01 +0100, Andreas Kurz wrote:
> Hello Chris,
> 
> On 12/01/2011 06:25 PM, Chris Bowlby wrote:
> > Hi Everyone, 
> > 
> > I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster
> > using the following packages:
> > 
> > SLES 11 SP1, with Pacemaker 1.1.6, corosync 1.4.2, and drbd 8.3.12.
> > 
> > I know about DHCP's internal fail-over abilities, but after testing, it
> > simply failed to remain viable as a more robust HA type cluster. As such
> > I began working on this solution. For reference my current configuration
> > looks like this:
> > 
> > node dhcp-vm01 \
> >         attributes standby="off"
> > node dhcp-vm02 \
> >         attributes standby="on"
> > primitive DHCPFS ocf:heartbeat:Filesystem \
> >         params device="/dev/drbd1" directory="/var/lib/dhcp"
> > fstype="ext4" \
> >         meta target-role="Started"
> > primitive dhcp-cluster ocf:heartbeat:IPaddr2 \
> >         params ip="xxx.xxx.xxx.xxx" cidr_netmask="32" \
> >         op monitor interval="10s"
> > primitive dhcpd_service ocf:heartbeat:dhcpd \
> >         params dhcpd_config="/etc/dhcpd.conf" \
> >     dhcpd_interface="eth0" \
> >         op monitor interval="1min" \
> >         meta target-role="Started"
> > primitive dhcpdrbd ocf:linbit:drbd \
> >         params drbd_resource="dhcpdata" \
> >         op monitor interval="60s"
> > ms DHCPData dhcpdrbd \
> >         meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true"
> > colocation dhcpd_service-with_cluster_ip inf: dhcpd_service dhcp-cluster
> > colocation fs_on_drbd inf: DHCPFS DHCPData:Master
> > order DHCP-after-dhcpfs inf: DHCPFS:promote dhcpd_service:start
> > order dhcpfs_after_dhcpdata inf: DHCPData:promote DHCPFS:start
> 
> DHCPFS:promote ?? .. that action will never occour, so dhcpd_service
> will start whenever it likes ... typically not when it should ;-)
> 
> ... remove that :promote ... and you miss a colocation between
> dhcpd_service and it's file system.
> 
> I'd suggest using a group and colocate/order that with DRBD:
> 
> group g_dhcp DHCPFS dhcpd_service dhcp-cluster
> 
> .. or IP before dhcp if it needs to bind to it
> 
> Regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> > property $id="cib-bootstrap-options" \
> >         dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
> >         cluster-infrastructure="openais" \
> >         expected-quorum-votes="2" \
> >         stonith-enabled="false" \
> >         no-quorum-policy="ignore"
> > rsc_defaults $id="rsc-options" \
> >         resource-stickiness="100"
> > 
> > The floating IP works without issue, as does the DRBD integration such
> > that if I put a node into standby, the IP, DRBD master/slave and FS
> > mounts all transfer correctly. Only the DHCP component itself is
> > failing, in that it wont start properly from within pacemaker. 
> > 
> > I suspect it is due to having to write a new script as I could not find
> > an existing DHCPD RA agent anywhere. I built my own based off the
> > development guide for resource agents on the wiki. I've managed to get
> > it to complete all the tests I need it to pass in the ocf-tester script:
> > 
> > ocf-tester -n dhcpd -o
> > monitor_client_interface=eth0 /usr/lib/ocf/resource.d/heartbeat/dhcpd
> > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/dhcpd...
> > * Your agent does not support the notify action (optional)
> > * Your agent does not support the demote action (optional)
> > * Your agent does not support the promote action (optional)
> > * Your agent does not support master/slave (optional)
> > /usr/lib/ocf/resource.d/heartbeat/dhcpd passed all tests
> > 
> > Additionally if I run each of the various options
> > (start/stop/monitor/validate-all/status/meta-data) at the command line,
> > they all work with out issue, and stop/start the DHCPD process as
> > expected.
> > 
> > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp
> > root     12516  0.0  0.1   4344   756 pts/3    S+   17:16   0:00 grep
> > dhcp
> > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat
> > # /usr/lib/ocf/resource.d/heartbeat/dhcpd start
> > DEBUG: Validating the dhcpd binary exists.
> > DEBUG: Validating that we are running in chrooted mode
> > DEBUG: Chrooted mode is active, testing the chrooted path exists
> > DEBUG: Checking to see if the /var/lib/dhcp//etc/dhcpd.conf exists and
> > is readable
> > DEBUG: Validating the dhcpd user exists
> > DEBUG: Validation complete, everything looks good.
> > DEBUG: Testing the state of the daemon itself
> > DEBUG: OCF_NOT_RUNNING: 7
> > INFO: The dhcpd process is not running
> > Internet Systems Consortium DHCP Server V3.1-ESV
> > Copyright 2004-2010 Internet Systems Consortium.
> > All rights reserved.
> > For info, please visit https://www.isc.org/software/dhcp/
> > WARNING: Host declarations are global.  They are not limited to the
> > scope you declared them in.
> > Not searching LDAP since ldap-server, ldap-port and ldap-base-dn were
> > not specified in the config file
> > Wrote 0 deleted host decls to leases file.
> > Wrote 0 new dynamic host decls to leases file.
> > Wrote 0 leases to leases file.
> > Listening on LPF/eth0/00:0c:29:d7:64:99/SERVERS
> > Sending on   LPF/eth0/00:0c:29:d7:64:99/SERVERS
> > Sending on   Socket/fallback/fallback-net
> > 0
> > INFO: dhcpd [chrooted] has started.
> > DEBUG: Resource Agent Exit Status 0
> > DEBUG: default start returned 0
> > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp
> > dhcpd    12653  0.0  0.2  26636  1164 ?        Ss   17:16   0:00 dhcpd
> > -cf /etc/dhcpd.conf -chroot /var/lib/dhcp -lf /db/dhcpd.leases -user
> > dhcpd -group nogroup -pf /var/run/dhcpd.pid
> > root     12658  0.0  0.1   4344   752 pts/3    S+   17:16   0:00 grep
> > dhcp
> > 
> > However, when I try to do the same from within pacemaker it fails to
> > properly start up and I get the following error (crm_mon):
> > 
> > Failed actions:
> >     dhcpd_service_monitor_0 (node=dhcp-vm01, call=3, rc=5,
> > status=complete): not installed
> >     dhcpd_service_monitor_0 (node=dhcp-vm02, call=3, rc=5,
> > status=complete): not installed
> > 
> > After a bit of digging through the syslog log entries, I've tracked down
> > the following lines:
> > 
> > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
> > dhcpd_service_monitor_0 on dhcp-vm01 returned 5 (not installed) instead
> > of the expected value: 7 (not running)
> > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
> > error - dhcpd_service_monitor_0 failed with rc=5: Preventing
> > dhcpd_service from re-starting on dhcp-vm01
> > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
> > dhcpd_service_monitor_0 on dhcp-vm02 returned 5 (not installed) instead
> > of the expected value: 7 (not running)
> > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
> > error - dhcpd_service_monitor_0 failed with rc=5: Preventing
> > dhcpd_service from re-starting on dhcp-vm02
> > 
> > Of which I then took a closer look at the monitor/status and
> > validate-all functions in my script:
> > 
> > # Validate most critical parameters
> > dhcpd_validate_all() {
> >     ocf_log debug "Validating the ${OCF_RESKEY_dhcpd} binary exists."
> >     check_binary ${OCF_RESKEY_dhcpd}
> > 
> >     if [ ocf_is_probe ] ; then
> >         ocf_log debug "Validating that we are running in chrooted mode"
> >         if ocf_is_true ${OCF_RESKEY_dhcpd_chrooted}; then
> >             ocf_log debug "Chrooted mode is active, testing the chrooted
> > path exists"
> >             if ! test -e "${OCF_RESKEY_dhcpd_chrooted_path}"; then
> >                 ocf_log err "Path ${OCF_RESKEY_dhcpd_chrooted_path} does
> > not exist."
> >                 return $OCF_ERR_INSTALLED
> >             fi
> > 
> >             ocf_log debug "Checking to see if the
> > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config} exists and
> > is readable"
> >             if test -n
> > "${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config}" -a ! -r
> > "${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config}"; then
> >                 ocf_log err "Configuration file
> > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config} doesn't
> > exist"
> >                 return $OCF_ERR_INSTALLED
> >             fi
> >         fi
> >     else
> >         ocf_log info "${OCF_RESKEY_dhcpd_chrooted_path} not readable
> > during probe."
> >         return $OCF_ERR_INSTALLED
> >     fi
> > 
> >     ocf_log debug "Validating the ${OCF_RESKEY_dhcpd_user} user exists"
> >     getent passwd ${OCF_RESKEY_dhcpd_user} >/dev/null 2>&1
> >     if ! test $? -eq 0; then
> >         ocf_log err "User ${OCF_RESKEY_dhcpd_user} doesn't exist";
> >         return $OCF_ERR_INSTALLED
> >     fi
> > 
> >     ocf_log debug "Validation complete, everything looks good."
> > 
> >     return $OCF_SUCCESS
> > }
> > 
> > # dhcpd_status. Simple check of the status of dhcpd process by pidfile.
> > dhcpd_status () {
> >     if ocf_is_true ${OCF_RESKEY_dhcpd_chrooted}; then
> >         ocf_pidfile_status
> > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_pidfile} >/dev/null
> > 2>&1
> >     else
> >         ocf_pidfile_status ${OCF_RESKEY_dhcpd_pidfile} >/dev/null 2>&1
> >     fi
> > }
> > 
> > # dhcpd_monitor. Send a request to dhcpd and check response.
> > dhcpd_monitor() {
> >     local output
> > 
> >     ocf_log debug "Testing the state of the daemon itself"
> >     ocf_log debug "OCF_NOT_RUNNING: $OCF_NOT_RUNNING"
> >     if ! dhcpd_status
> >     then
> >         ocf_log info "The dhcpd process is not running"
> >         return $OCF_NOT_RUNNING
> >     fi
> > 
> >     return $OCF_SUCCESS
> > }
> > 
> > I see nothing wrong that would tell me it is returning a "not installed"
> > state during the validate or the monitoring phases.
> > 
> > This script is a bit large, and I am attaching it for reference to see
> > if anyone can take a peak and point out anything I am overlooking. The
> > script itself is using the same "concepts" that were defined in the
> > named RA script, and blended with the official RA developers guide. It
> > also borrows some code from the main DHCPD init script that ships with
> > SLES 11. 
> > 
> > The script is not yet finalized in that some extra monitoring elements
> > are "partially" there, but not yet fully worked, and chrooted mode is
> > currently the only mode supported (why would you run a non-chrooted DHCP
> > server?!!?). In addition acknowledgment of original authors is not yet
> > in there, and will be added once I get closer to a more complete script.
> > 
> > Any help would be appreciated, and if additional details are needed, let
> > me know and I will fill in any holes I can.
> > Thanks
> > Chris
> > 
> > 
> > 
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Custom resource agent script assistance

Reply via email to