Re: [Linux-HA] Custom resource agent script assistance

Chris Bowlby Fri, 02 Dec 2011 07:37:24 -0800

As an aside, it is starting on the primary, but will fail to start on
the secondary during a fail-over, so I suspect there is some work I now
need to do in the script itself.


Florian has made a request of me that I'm going to try to full-fill
today, and that might make it easier to do.

On Fri, 2011-12-02 at 11:19 -0400, Chris Bowlby wrote:
> Hi Andreas, 
> 
>  I've made the changes you've suggested, and while the grouping is
> working nicely, I'm still getting a "not installed" error for DHCP
> itself. However, on closer inspection it still looks like it is
> attempting to start DHCP on the secondary node. Here is the updated
> configuration based on your changes:
> 
> node dhcp-vm01 \
>         attributes standby="off"
> node dhcp-vm02 \
>         attributes standby="off"
> primitive DHCPFS ocf:heartbeat:Filesystem \
>         params device="/dev/drbd1" directory="/var/lib/dhcp"
> fstype="ext4" \
>         meta target-role="Started"
> primitive dhcp-cluster ocf:heartbeat:IPaddr2 \
>         params ip="xxx.xxx.xxx.xxx" cidr_netmask="32" \
>         op monitor interval="10s"
> primitive dhcpd_service ocf:heartbeat:dhcpd \
>         params dhcpd_config="/etc/dhcpd.conf" dhcpd_interface="eth0" \
>         op monitor interval="1min" \
>         meta target-role="Started"
> primitive dhcpdrbd ocf:linbit:drbd \
>         params drbd_resource="dhcpdata" \
>         op monitor interval="60s"
> group g_dhcp DHCPFS dhcp-cluster dhcpd_service
> ms DHCPData dhcpdrbd \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation fs_on_drbd inf: g_dhcp DHCPData:Master
> order dhcpfs_after_dhcpdata inf: DHCPData:promote g_dhcp:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> The error in crm_mond remains as follows:
> 
> Failed actions:
>     dhcpd_service_monitor_0 (node=dhcp-vm02, call=3, rc=5,
> status=complete): not installed
> 
> And the logs still report:
> 
> Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
> dhcpd_service_monitor_0 on dhcp-vm01 returned 5 (not installed) instead
> of the expected value: 7 (not running)
> Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
> error - dhcpd_service_monitor_0 failed with rc=5: Preventing
> dhcpd_service from re-starting on dhcp-vm01
> Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
> dhcpd_service_monitor_0 on dhcp-vm02 returned 5 (not installed) instead
> of the expected value: 7 (not running)
> Dec  2 15:11:57 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
> error - dhcpd_service_monitor_0 failed with rc=5: Preventing
> dhcpd_service from re-starting on dhcp-vm02
> 
> As a side note, when I configured the grouping, most of the colo's went
> away, and the DRBD:Master colo was updated to use the group, as was the
> order statement.
> 
> We are getting close, and this gives me confidence that it was not a
> major issue in the scripts itself, but more in my crm configuration.
> 
> On Fri, 2011-12-02 at 01:01 +0100, Andreas Kurz wrote:
> > Hello Chris,
> > 
> > On 12/01/2011 06:25 PM, Chris Bowlby wrote:
> > > Hi Everyone, 
> > > 
> > > I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster
> > > using the following packages:
> > > 
> > > SLES 11 SP1, with Pacemaker 1.1.6, corosync 1.4.2, and drbd 8.3.12.
> > > 
> > > I know about DHCP's internal fail-over abilities, but after testing, it
> > > simply failed to remain viable as a more robust HA type cluster. As such
> > > I began working on this solution. For reference my current configuration
> > > looks like this:
> > > 
> > > node dhcp-vm01 \
> > >         attributes standby="off"
> > > node dhcp-vm02 \
> > >         attributes standby="on"
> > > primitive DHCPFS ocf:heartbeat:Filesystem \
> > >         params device="/dev/drbd1" directory="/var/lib/dhcp"
> > > fstype="ext4" \
> > >         meta target-role="Started"
> > > primitive dhcp-cluster ocf:heartbeat:IPaddr2 \
> > >         params ip="xxx.xxx.xxx.xxx" cidr_netmask="32" \
> > >         op monitor interval="10s"
> > > primitive dhcpd_service ocf:heartbeat:dhcpd \
> > >         params dhcpd_config="/etc/dhcpd.conf" \
> > >   dhcpd_interface="eth0" \
> > >         op monitor interval="1min" \
> > >         meta target-role="Started"
> > > primitive dhcpdrbd ocf:linbit:drbd \
> > >         params drbd_resource="dhcpdata" \
> > >         op monitor interval="60s"
> > > ms DHCPData dhcpdrbd \
> > >         meta master-max="1" master-node-max="1" clone-max="2"
> > > clone-node-max="1" notify="true"
> > > colocation dhcpd_service-with_cluster_ip inf: dhcpd_service dhcp-cluster
> > > colocation fs_on_drbd inf: DHCPFS DHCPData:Master
> > > order DHCP-after-dhcpfs inf: DHCPFS:promote dhcpd_service:start
> > > order dhcpfs_after_dhcpdata inf: DHCPData:promote DHCPFS:start
> > 
> > DHCPFS:promote ?? .. that action will never occour, so dhcpd_service
> > will start whenever it likes ... typically not when it should ;-)
> > 
> > ... remove that :promote ... and you miss a colocation between
> > dhcpd_service and it's file system.
> > 
> > I'd suggest using a group and colocate/order that with DRBD:
> > 
> > group g_dhcp DHCPFS dhcpd_service dhcp-cluster
> > 
> > .. or IP before dhcp if it needs to bind to it
> > 
> > Regards,
> > Andreas
> > 
> > -- 
> > Need help with Pacemaker?
> > http://www.hastexo.com/now
> > 
> > > property $id="cib-bootstrap-options" \
> > >         dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \
> > >         cluster-infrastructure="openais" \
> > >         expected-quorum-votes="2" \
> > >         stonith-enabled="false" \
> > >         no-quorum-policy="ignore"
> > > rsc_defaults $id="rsc-options" \
> > >         resource-stickiness="100"
> > > 
> > > The floating IP works without issue, as does the DRBD integration such
> > > that if I put a node into standby, the IP, DRBD master/slave and FS
> > > mounts all transfer correctly. Only the DHCP component itself is
> > > failing, in that it wont start properly from within pacemaker. 
> > > 
> > > I suspect it is due to having to write a new script as I could not find
> > > an existing DHCPD RA agent anywhere. I built my own based off the
> > > development guide for resource agents on the wiki. I've managed to get
> > > it to complete all the tests I need it to pass in the ocf-tester script:
> > > 
> > > ocf-tester -n dhcpd -o
> > > monitor_client_interface=eth0 /usr/lib/ocf/resource.d/heartbeat/dhcpd
> > > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/dhcpd...
> > > * Your agent does not support the notify action (optional)
> > > * Your agent does not support the demote action (optional)
> > > * Your agent does not support the promote action (optional)
> > > * Your agent does not support master/slave (optional)
> > > /usr/lib/ocf/resource.d/heartbeat/dhcpd passed all tests
> > > 
> > > Additionally if I run each of the various options
> > > (start/stop/monitor/validate-all/status/meta-data) at the command line,
> > > they all work with out issue, and stop/start the DHCPD process as
> > > expected.
> > > 
> > > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp
> > > root     12516  0.0  0.1   4344   756 pts/3    S+   17:16   0:00 grep
> > > dhcp
> > > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat
> > > # /usr/lib/ocf/resource.d/heartbeat/dhcpd start
> > > DEBUG: Validating the dhcpd binary exists.
> > > DEBUG: Validating that we are running in chrooted mode
> > > DEBUG: Chrooted mode is active, testing the chrooted path exists
> > > DEBUG: Checking to see if the /var/lib/dhcp//etc/dhcpd.conf exists and
> > > is readable
> > > DEBUG: Validating the dhcpd user exists
> > > DEBUG: Validation complete, everything looks good.
> > > DEBUG: Testing the state of the daemon itself
> > > DEBUG: OCF_NOT_RUNNING: 7
> > > INFO: The dhcpd process is not running
> > > Internet Systems Consortium DHCP Server V3.1-ESV
> > > Copyright 2004-2010 Internet Systems Consortium.
> > > All rights reserved.
> > > For info, please visit https://www.isc.org/software/dhcp/
> > > WARNING: Host declarations are global.  They are not limited to the
> > > scope you declared them in.
> > > Not searching LDAP since ldap-server, ldap-port and ldap-base-dn were
> > > not specified in the config file
> > > Wrote 0 deleted host decls to leases file.
> > > Wrote 0 new dynamic host decls to leases file.
> > > Wrote 0 leases to leases file.
> > > Listening on LPF/eth0/00:0c:29:d7:64:99/SERVERS
> > > Sending on   LPF/eth0/00:0c:29:d7:64:99/SERVERS
> > > Sending on   Socket/fallback/fallback-net
> > > 0
> > > INFO: dhcpd [chrooted] has started.
> > > DEBUG: Resource Agent Exit Status 0
> > > DEBUG: default start returned 0
> > > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp
> > > dhcpd    12653  0.0  0.2  26636  1164 ?        Ss   17:16   0:00 dhcpd
> > > -cf /etc/dhcpd.conf -chroot /var/lib/dhcp -lf /db/dhcpd.leases -user
> > > dhcpd -group nogroup -pf /var/run/dhcpd.pid
> > > root     12658  0.0  0.1   4344   752 pts/3    S+   17:16   0:00 grep
> > > dhcp
> > > 
> > > However, when I try to do the same from within pacemaker it fails to
> > > properly start up and I get the following error (crm_mon):
> > > 
> > > Failed actions:
> > >     dhcpd_service_monitor_0 (node=dhcp-vm01, call=3, rc=5,
> > > status=complete): not installed
> > >     dhcpd_service_monitor_0 (node=dhcp-vm02, call=3, rc=5,
> > > status=complete): not installed
> > > 
> > > After a bit of digging through the syslog log entries, I've tracked down
> > > the following lines:
> > > 
> > > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
> > > dhcpd_service_monitor_0 on dhcp-vm01 returned 5 (not installed) instead
> > > of the expected value: 7 (not running)
> > > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
> > > error - dhcpd_service_monitor_0 failed with rc=5: Preventing
> > > dhcpd_service from re-starting on dhcp-vm01
> > > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op:
> > > dhcpd_service_monitor_0 on dhcp-vm02 returned 5 (not installed) instead
> > > of the expected value: 7 (not running)
> > > Dec  1 16:21:22 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard
> > > error - dhcpd_service_monitor_0 failed with rc=5: Preventing
> > > dhcpd_service from re-starting on dhcp-vm02
> > > 
> > > Of which I then took a closer look at the monitor/status and
> > > validate-all functions in my script:
> > > 
> > > # Validate most critical parameters
> > > dhcpd_validate_all() {
> > >     ocf_log debug "Validating the ${OCF_RESKEY_dhcpd} binary exists."
> > >     check_binary ${OCF_RESKEY_dhcpd}
> > > 
> > >     if [ ocf_is_probe ] ; then
> > >         ocf_log debug "Validating that we are running in chrooted mode"
> > >         if ocf_is_true ${OCF_RESKEY_dhcpd_chrooted}; then
> > >             ocf_log debug "Chrooted mode is active, testing the chrooted
> > > path exists"
> > >             if ! test -e "${OCF_RESKEY_dhcpd_chrooted_path}"; then
> > >                 ocf_log err "Path ${OCF_RESKEY_dhcpd_chrooted_path} does
> > > not exist."
> > >                 return $OCF_ERR_INSTALLED
> > >             fi
> > > 
> > >             ocf_log debug "Checking to see if the
> > > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config} exists and
> > > is readable"
> > >             if test -n
> > > "${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config}" -a ! -r
> > > "${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config}"; then
> > >                 ocf_log err "Configuration file
> > > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config} doesn't
> > > exist"
> > >                 return $OCF_ERR_INSTALLED
> > >             fi
> > >         fi
> > >     else
> > >         ocf_log info "${OCF_RESKEY_dhcpd_chrooted_path} not readable
> > > during probe."
> > >         return $OCF_ERR_INSTALLED
> > >     fi
> > > 
> > >     ocf_log debug "Validating the ${OCF_RESKEY_dhcpd_user} user exists"
> > >     getent passwd ${OCF_RESKEY_dhcpd_user} >/dev/null 2>&1
> > >     if ! test $? -eq 0; then
> > >         ocf_log err "User ${OCF_RESKEY_dhcpd_user} doesn't exist";
> > >         return $OCF_ERR_INSTALLED
> > >     fi
> > > 
> > >     ocf_log debug "Validation complete, everything looks good."
> > > 
> > >     return $OCF_SUCCESS
> > > }
> > > 
> > > # dhcpd_status. Simple check of the status of dhcpd process by pidfile.
> > > dhcpd_status () {
> > >     if ocf_is_true ${OCF_RESKEY_dhcpd_chrooted}; then
> > >         ocf_pidfile_status
> > > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_pidfile} >/dev/null
> > > 2>&1
> > >     else
> > >         ocf_pidfile_status ${OCF_RESKEY_dhcpd_pidfile} >/dev/null 2>&1
> > >     fi
> > > }
> > > 
> > > # dhcpd_monitor. Send a request to dhcpd and check response.
> > > dhcpd_monitor() {
> > >     local output
> > > 
> > >     ocf_log debug "Testing the state of the daemon itself"
> > >     ocf_log debug "OCF_NOT_RUNNING: $OCF_NOT_RUNNING"
> > >     if ! dhcpd_status
> > >     then
> > >         ocf_log info "The dhcpd process is not running"
> > >         return $OCF_NOT_RUNNING
> > >     fi
> > > 
> > >     return $OCF_SUCCESS
> > > }
> > > 
> > > I see nothing wrong that would tell me it is returning a "not installed"
> > > state during the validate or the monitoring phases.
> > > 
> > > This script is a bit large, and I am attaching it for reference to see
> > > if anyone can take a peak and point out anything I am overlooking. The
> > > script itself is using the same "concepts" that were defined in the
> > > named RA script, and blended with the official RA developers guide. It
> > > also borrows some code from the main DHCPD init script that ships with
> > > SLES 11. 
> > > 
> > > The script is not yet finalized in that some extra monitoring elements
> > > are "partially" there, but not yet fully worked, and chrooted mode is
> > > currently the only mode supported (why would you run a non-chrooted DHCP
> > > server?!!?). In addition acknowledgment of original authors is not yet
> > > in there, and will be added once I get closer to a more complete script.
> > > 
> > > Any help would be appreciated, and if additional details are needed, let
> > > me know and I will fill in any holes I can.
> > > Thanks
> > > Chris
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > 
> > 
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Custom resource agent script assistance

Reply via email to