As an aside, it is starting on the primary, but will fail to start on the secondary during a fail-over, so I suspect there is some work I now need to do in the script itself.
Florian has made a request of me that I'm going to try to full-fill today, and that might make it easier to do. On Fri, 2011-12-02 at 11:19 -0400, Chris Bowlby wrote: > Hi Andreas, > > I've made the changes you've suggested, and while the grouping is > working nicely, I'm still getting a "not installed" error for DHCP > itself. However, on closer inspection it still looks like it is > attempting to start DHCP on the secondary node. Here is the updated > configuration based on your changes: > > node dhcp-vm01 \ > attributes standby="off" > node dhcp-vm02 \ > attributes standby="off" > primitive DHCPFS ocf:heartbeat:Filesystem \ > params device="/dev/drbd1" directory="/var/lib/dhcp" > fstype="ext4" \ > meta target-role="Started" > primitive dhcp-cluster ocf:heartbeat:IPaddr2 \ > params ip="xxx.xxx.xxx.xxx" cidr_netmask="32" \ > op monitor interval="10s" > primitive dhcpd_service ocf:heartbeat:dhcpd \ > params dhcpd_config="/etc/dhcpd.conf" dhcpd_interface="eth0" \ > op monitor interval="1min" \ > meta target-role="Started" > primitive dhcpdrbd ocf:linbit:drbd \ > params drbd_resource="dhcpdata" \ > op monitor interval="60s" > group g_dhcp DHCPFS dhcp-cluster dhcpd_service > ms DHCPData dhcpdrbd \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > colocation fs_on_drbd inf: g_dhcp DHCPData:Master > order dhcpfs_after_dhcpdata inf: DHCPData:promote g_dhcp:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > > The error in crm_mond remains as follows: > > Failed actions: > dhcpd_service_monitor_0 (node=dhcp-vm02, call=3, rc=5, > status=complete): not installed > > And the logs still report: > > Dec 2 15:11:57 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op: > dhcpd_service_monitor_0 on dhcp-vm01 returned 5 (not installed) instead > of the expected value: 7 (not running) > Dec 2 15:11:57 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard > error - dhcpd_service_monitor_0 failed with rc=5: Preventing > dhcpd_service from re-starting on dhcp-vm01 > Dec 2 15:11:57 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op: > dhcpd_service_monitor_0 on dhcp-vm02 returned 5 (not installed) instead > of the expected value: 7 (not running) > Dec 2 15:11:57 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard > error - dhcpd_service_monitor_0 failed with rc=5: Preventing > dhcpd_service from re-starting on dhcp-vm02 > > As a side note, when I configured the grouping, most of the colo's went > away, and the DRBD:Master colo was updated to use the group, as was the > order statement. > > We are getting close, and this gives me confidence that it was not a > major issue in the scripts itself, but more in my crm configuration. > > On Fri, 2011-12-02 at 01:01 +0100, Andreas Kurz wrote: > > Hello Chris, > > > > On 12/01/2011 06:25 PM, Chris Bowlby wrote: > > > Hi Everyone, > > > > > > I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster > > > using the following packages: > > > > > > SLES 11 SP1, with Pacemaker 1.1.6, corosync 1.4.2, and drbd 8.3.12. > > > > > > I know about DHCP's internal fail-over abilities, but after testing, it > > > simply failed to remain viable as a more robust HA type cluster. As such > > > I began working on this solution. For reference my current configuration > > > looks like this: > > > > > > node dhcp-vm01 \ > > > attributes standby="off" > > > node dhcp-vm02 \ > > > attributes standby="on" > > > primitive DHCPFS ocf:heartbeat:Filesystem \ > > > params device="/dev/drbd1" directory="/var/lib/dhcp" > > > fstype="ext4" \ > > > meta target-role="Started" > > > primitive dhcp-cluster ocf:heartbeat:IPaddr2 \ > > > params ip="xxx.xxx.xxx.xxx" cidr_netmask="32" \ > > > op monitor interval="10s" > > > primitive dhcpd_service ocf:heartbeat:dhcpd \ > > > params dhcpd_config="/etc/dhcpd.conf" \ > > > dhcpd_interface="eth0" \ > > > op monitor interval="1min" \ > > > meta target-role="Started" > > > primitive dhcpdrbd ocf:linbit:drbd \ > > > params drbd_resource="dhcpdata" \ > > > op monitor interval="60s" > > > ms DHCPData dhcpdrbd \ > > > meta master-max="1" master-node-max="1" clone-max="2" > > > clone-node-max="1" notify="true" > > > colocation dhcpd_service-with_cluster_ip inf: dhcpd_service dhcp-cluster > > > colocation fs_on_drbd inf: DHCPFS DHCPData:Master > > > order DHCP-after-dhcpfs inf: DHCPFS:promote dhcpd_service:start > > > order dhcpfs_after_dhcpdata inf: DHCPData:promote DHCPFS:start > > > > DHCPFS:promote ?? .. that action will never occour, so dhcpd_service > > will start whenever it likes ... typically not when it should ;-) > > > > ... remove that :promote ... and you miss a colocation between > > dhcpd_service and it's file system. > > > > I'd suggest using a group and colocate/order that with DRBD: > > > > group g_dhcp DHCPFS dhcpd_service dhcp-cluster > > > > .. or IP before dhcp if it needs to bind to it > > > > Regards, > > Andreas > > > > -- > > Need help with Pacemaker? > > http://www.hastexo.com/now > > > > > property $id="cib-bootstrap-options" \ > > > dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \ > > > cluster-infrastructure="openais" \ > > > expected-quorum-votes="2" \ > > > stonith-enabled="false" \ > > > no-quorum-policy="ignore" > > > rsc_defaults $id="rsc-options" \ > > > resource-stickiness="100" > > > > > > The floating IP works without issue, as does the DRBD integration such > > > that if I put a node into standby, the IP, DRBD master/slave and FS > > > mounts all transfer correctly. Only the DHCP component itself is > > > failing, in that it wont start properly from within pacemaker. > > > > > > I suspect it is due to having to write a new script as I could not find > > > an existing DHCPD RA agent anywhere. I built my own based off the > > > development guide for resource agents on the wiki. I've managed to get > > > it to complete all the tests I need it to pass in the ocf-tester script: > > > > > > ocf-tester -n dhcpd -o > > > monitor_client_interface=eth0 /usr/lib/ocf/resource.d/heartbeat/dhcpd > > > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/dhcpd... > > > * Your agent does not support the notify action (optional) > > > * Your agent does not support the demote action (optional) > > > * Your agent does not support the promote action (optional) > > > * Your agent does not support master/slave (optional) > > > /usr/lib/ocf/resource.d/heartbeat/dhcpd passed all tests > > > > > > Additionally if I run each of the various options > > > (start/stop/monitor/validate-all/status/meta-data) at the command line, > > > they all work with out issue, and stop/start the DHCPD process as > > > expected. > > > > > > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp > > > root 12516 0.0 0.1 4344 756 pts/3 S+ 17:16 0:00 grep > > > dhcp > > > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat > > > # /usr/lib/ocf/resource.d/heartbeat/dhcpd start > > > DEBUG: Validating the dhcpd binary exists. > > > DEBUG: Validating that we are running in chrooted mode > > > DEBUG: Chrooted mode is active, testing the chrooted path exists > > > DEBUG: Checking to see if the /var/lib/dhcp//etc/dhcpd.conf exists and > > > is readable > > > DEBUG: Validating the dhcpd user exists > > > DEBUG: Validation complete, everything looks good. > > > DEBUG: Testing the state of the daemon itself > > > DEBUG: OCF_NOT_RUNNING: 7 > > > INFO: The dhcpd process is not running > > > Internet Systems Consortium DHCP Server V3.1-ESV > > > Copyright 2004-2010 Internet Systems Consortium. > > > All rights reserved. > > > For info, please visit https://www.isc.org/software/dhcp/ > > > WARNING: Host declarations are global. They are not limited to the > > > scope you declared them in. > > > Not searching LDAP since ldap-server, ldap-port and ldap-base-dn were > > > not specified in the config file > > > Wrote 0 deleted host decls to leases file. > > > Wrote 0 new dynamic host decls to leases file. > > > Wrote 0 leases to leases file. > > > Listening on LPF/eth0/00:0c:29:d7:64:99/SERVERS > > > Sending on LPF/eth0/00:0c:29:d7:64:99/SERVERS > > > Sending on Socket/fallback/fallback-net > > > 0 > > > INFO: dhcpd [chrooted] has started. > > > DEBUG: Resource Agent Exit Status 0 > > > DEBUG: default start returned 0 > > > dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp > > > dhcpd 12653 0.0 0.2 26636 1164 ? Ss 17:16 0:00 dhcpd > > > -cf /etc/dhcpd.conf -chroot /var/lib/dhcp -lf /db/dhcpd.leases -user > > > dhcpd -group nogroup -pf /var/run/dhcpd.pid > > > root 12658 0.0 0.1 4344 752 pts/3 S+ 17:16 0:00 grep > > > dhcp > > > > > > However, when I try to do the same from within pacemaker it fails to > > > properly start up and I get the following error (crm_mon): > > > > > > Failed actions: > > > dhcpd_service_monitor_0 (node=dhcp-vm01, call=3, rc=5, > > > status=complete): not installed > > > dhcpd_service_monitor_0 (node=dhcp-vm02, call=3, rc=5, > > > status=complete): not installed > > > > > > After a bit of digging through the syslog log entries, I've tracked down > > > the following lines: > > > > > > Dec 1 16:21:22 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op: > > > dhcpd_service_monitor_0 on dhcp-vm01 returned 5 (not installed) instead > > > of the expected value: 7 (not running) > > > Dec 1 16:21:22 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard > > > error - dhcpd_service_monitor_0 failed with rc=5: Preventing > > > dhcpd_service from re-starting on dhcp-vm01 > > > Dec 1 16:21:22 dhcp-vm01 pengine: [31978]: debug: unpack_rsc_op: > > > dhcpd_service_monitor_0 on dhcp-vm02 returned 5 (not installed) instead > > > of the expected value: 7 (not running) > > > Dec 1 16:21:22 dhcp-vm01 pengine: [31978]: notice: unpack_rsc_op: Hard > > > error - dhcpd_service_monitor_0 failed with rc=5: Preventing > > > dhcpd_service from re-starting on dhcp-vm02 > > > > > > Of which I then took a closer look at the monitor/status and > > > validate-all functions in my script: > > > > > > # Validate most critical parameters > > > dhcpd_validate_all() { > > > ocf_log debug "Validating the ${OCF_RESKEY_dhcpd} binary exists." > > > check_binary ${OCF_RESKEY_dhcpd} > > > > > > if [ ocf_is_probe ] ; then > > > ocf_log debug "Validating that we are running in chrooted mode" > > > if ocf_is_true ${OCF_RESKEY_dhcpd_chrooted}; then > > > ocf_log debug "Chrooted mode is active, testing the chrooted > > > path exists" > > > if ! test -e "${OCF_RESKEY_dhcpd_chrooted_path}"; then > > > ocf_log err "Path ${OCF_RESKEY_dhcpd_chrooted_path} does > > > not exist." > > > return $OCF_ERR_INSTALLED > > > fi > > > > > > ocf_log debug "Checking to see if the > > > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config} exists and > > > is readable" > > > if test -n > > > "${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config}" -a ! -r > > > "${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config}"; then > > > ocf_log err "Configuration file > > > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_config} doesn't > > > exist" > > > return $OCF_ERR_INSTALLED > > > fi > > > fi > > > else > > > ocf_log info "${OCF_RESKEY_dhcpd_chrooted_path} not readable > > > during probe." > > > return $OCF_ERR_INSTALLED > > > fi > > > > > > ocf_log debug "Validating the ${OCF_RESKEY_dhcpd_user} user exists" > > > getent passwd ${OCF_RESKEY_dhcpd_user} >/dev/null 2>&1 > > > if ! test $? -eq 0; then > > > ocf_log err "User ${OCF_RESKEY_dhcpd_user} doesn't exist"; > > > return $OCF_ERR_INSTALLED > > > fi > > > > > > ocf_log debug "Validation complete, everything looks good." > > > > > > return $OCF_SUCCESS > > > } > > > > > > # dhcpd_status. Simple check of the status of dhcpd process by pidfile. > > > dhcpd_status () { > > > if ocf_is_true ${OCF_RESKEY_dhcpd_chrooted}; then > > > ocf_pidfile_status > > > ${OCF_RESKEY_dhcpd_chrooted_path}/${OCF_RESKEY_dhcpd_pidfile} >/dev/null > > > 2>&1 > > > else > > > ocf_pidfile_status ${OCF_RESKEY_dhcpd_pidfile} >/dev/null 2>&1 > > > fi > > > } > > > > > > # dhcpd_monitor. Send a request to dhcpd and check response. > > > dhcpd_monitor() { > > > local output > > > > > > ocf_log debug "Testing the state of the daemon itself" > > > ocf_log debug "OCF_NOT_RUNNING: $OCF_NOT_RUNNING" > > > if ! dhcpd_status > > > then > > > ocf_log info "The dhcpd process is not running" > > > return $OCF_NOT_RUNNING > > > fi > > > > > > return $OCF_SUCCESS > > > } > > > > > > I see nothing wrong that would tell me it is returning a "not installed" > > > state during the validate or the monitoring phases. > > > > > > This script is a bit large, and I am attaching it for reference to see > > > if anyone can take a peak and point out anything I am overlooking. The > > > script itself is using the same "concepts" that were defined in the > > > named RA script, and blended with the official RA developers guide. It > > > also borrows some code from the main DHCPD init script that ships with > > > SLES 11. > > > > > > The script is not yet finalized in that some extra monitoring elements > > > are "partially" there, but not yet fully worked, and chrooted mode is > > > currently the only mode supported (why would you run a non-chrooted DHCP > > > server?!!?). In addition acknowledgment of original authors is not yet > > > in there, and will be added once I get closer to a more complete script. > > > > > > Any help would be appreciated, and if additional details are needed, let > > > me know and I will fill in any holes I can. > > > Thanks > > > Chris > > > > > > > > > > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > > > > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
