Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Andrew, I can try and fix that if you re-run with -x and paste the output. (apache-03) [~] crm_report -l /var/adm/syslog/2013/08/05 -f 2013-08-04 18:30:00 -t 2013-08-04 19:15 -x + shift + true + [ ! -z ] + break + [ x != x ] + [ x1375633800 != x ] + masterlog= + [ -z ] + log WARNING: The tarball produced by this program may contain + printf %-10s WARNING: The tarball produced by this program may contain\n apache-03: apache-03: WARNING: The tarball produced by this program may contain + log sensitive information such as passwords. + printf %-10s sensitive information such as passwords.\n apache-03: apache-03: sensitive information such as passwords. + log + printf %-10s \n apache-03: apache-03: + log We will attempt to remove such information if you use the + printf %-10s We will attempt to remove such information if you use the\n apache-03: apache-03: We will attempt to remove such information if you use the + log -p option. For example: -p pass.* -p user.* + printf %-10s -p option. For example: -p pass.* -p user.*\n apache-03: apache-03: -p option. For example: -p pass.* -p user.* + log + printf %-10s \n apache-03: apache-03: + log However, doing this may reduce the ability for the recipients + printf %-10s However, doing this may reduce the ability for the recipients\n apache-03: apache-03: However, doing this may reduce the ability for the recipients + log to diagnose issues and generally provide assistance. + printf %-10s to diagnose issues and generally provide assistance.\n apache-03: apache-03: to diagnose issues and generally provide assistance. + log + printf %-10s \n apache-03: apache-03: + log IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE + printf %-10s IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE\n apache-03: apache-03: IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE + log + printf %-10s \n apache-03: apache-03: + [ -z ] + getnodes any + [ -z any ] + cluster=any + [ -z ] + HA_STATE_DIR=/var/lib/heartbeat + find_cluster_cf any + warning Unknown cluster type: any + log WARN: Unknown cluster type: any + printf %-10s WARN: Unknown cluster type: any\n apache-03: apache-03: WARN: Unknown cluster type: any + cluster_cf= + ps -ef + egrep -qs [c]ib + debug Querying CIB for nodes + [ 0 -gt 0 ] + cibadmin -Ql -o nodes + awk /type=normal/ { for( i=1; i=NF; i++ ) if( $i~/^uname=/ ) { sub(uname=.,,$i); sub(\.*,,$i); print $i; next; } } + tr \n + nodes=apache-03 apache-04 + log Calculated node list: apache-03 apache-04 + printf %-10s Calculated node list: apache-03 apache-04 \n apache-03: apache-03: Calculated node list: apache-03 apache-04 + [ -z apache-03 apache-04 ] + echo apache-03 apache-04 + grep -qs apache-03 + debug We are a cluster node + [ 0 -gt 0 ] + [ -z 1375636500 ] + date +%a-%d-%b-%Y + label=pcmk-Wed-07-Aug-2013 + time2str 1375633800 + perl -e use POSIX; print strftime('%x %X',localtime(1375633800)); + time2str 1375636500 + perl -e use POSIX; print strftime('%x %X',localtime(1375636500)); + log Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00) + printf %-10s Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00)\n apache-03: apache-03: Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00) + collect_data pcmk-Wed-07-Aug-2013 1375633800 1375636500 + label=pcmk-Wed-07-Aug-2013 + expr 1375633800 - 10 + start=1375633790 + expr 1375636500 + 10 + end=1375636510 + masterlog= + [ x != x ] + l_base=/home/tg/pcmk-Wed-07-Aug-2013 + r_base=pcmk-Wed-07-Aug-2013 + [ -e /home/tg/pcmk-Wed-07-Aug-2013 ] + mkdir -p /home/tg/pcmk-Wed-07-Aug-2013 + [ x != x ] + cat + [ apache-03 = apache-03 ] + cat + cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common /usr/share/pacemaker/report.collector + bash /home/tg/pcmk-Wed-07-Aug-2013/collector apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path + cat + [ apache-03 = apache-04 ] + cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common /usr/share/pacemaker/report.collector + ssh+ -l root -T apache-04 -- mkdir -p pcmk-Wed-07-Aug-2013; cat pcmk-Wed-07-Aug-2013/collector; bash pcmk-Wed-07-Aug-2013/collectorcd /home/tg/pcmk-Wed-07-Aug-2013 + tar xf - apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors + analyze /home/tg/pcmk-Wed-07-Aug-2013 + flist=hostcache members.txt cib.xml crm_mon.txt logd.cf sysinfo.txt + printf Diff hostcache... + ls /home/tg/pcmk-Wed-07-Aug-2013/*/hostcache + echo no
Re: [Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes
Hello Andrew, As I said The cluster only stops doing this if writing to disk fails at some point - but there would have been an error in your logs if that were the case. I grepped in the logs and found out that there was a write error on 15 Juli and probably all changes after that did not went to the disk. (apache-03) [/var/adm/syslog/2013] grep 'Disk write failed' ??/??/* 07/15/daemon:Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 07/15/daemon:Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 08/04/daemon:Aug 4 19:03:55 apache-03 cib: [3226]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 08/04/daemon:Aug 4 19:03:56 apache-04-intern cib: [3197]: ERROR: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 And it looks like the reason for that was not a bad disk, but a failure in another component: Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - cib admin_epoch=0 epoch=19 num_updates=3 Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - configuration Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - resources Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - group id=nfs Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - primitive id=gcl_fs Jul 15 17:55:04 apache-03 crmd: [29398]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.20.1) : Non-status change Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - meta_attributes id=gcl_fs-meta_attributes Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - nvpair id=gcl_fs-meta_attributes-target-role name=target-role value=Started __crm_diff_marker__=removed:top / Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /meta_attributes Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /primitive Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - meta_attributes id=nfs-meta_attributes Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - nvpair value=Stopped id=nfs-meta_attributes-target-role / Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /meta_attributes Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /group Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /resources Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /configuration Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /cib Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + cib epoch=20 num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.6 update-origin=apache-03 update-client=cibadmin cib-last-written=Mon Jul 15 16:02:23 2013 have-quorum=1 dc-uuid=61e8f424-b538-4352-b3fe-955ca853e5fb Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + configuration Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + resources Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + group id=nfs Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + meta_attributes id=nfs-meta_attributes Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + nvpair id=nfs-meta_attributes-target-role name=target-role value=Started / Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /meta_attributes Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /group Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /resources Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /configuration Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /cib Jul 15 17:55:04 apache-03 cib: [29394]: info: cib_process_request: Operation complete: op cib_replace for section resources (origin=local/cibadmin/2, version=0.20.1): ok (rc=0) Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: validate_cib_digest: Digest comparision failed: expected 976068d203615e656547fdf60190ad16 (/var/lib/heartbeat/crm/cib.b9SItG), calculated 3f273f2cf3f97c0c02be83555ecabf0d Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.p3FraX failed! Configuration contents ignored! Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: crm_abort: write_cib_contents: Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start nfs-common
Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Andrew, It really helps to read the output of the commands you're running: Did you not see these messages the first time? apache-03: WARN: Unknown cluster type: any apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path Try adding -H and --logfile {somevalue} next time. I'll do that and report back. An updated pacemaker is the important part. Whether you switch to corosync too is up to you. I'll do that. Pacemaker+heartbeat is by far the least tested combination. What is the best tested combination? Pacemaker and corosync? Any specific version or should I go with the lastest release of both? Best to poke the debian maintainers I'll do that as well. Do you mean See that the monitors _work, then_ take the system out of maintance mode...? If so, then yes. Yes, that is what I want to do. :-) Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes
Hello Andrew, Any change to the configuration section is automatically written to disk. The cluster only stops doing this if writing to disk fails at some point - but there would have been an error in your logs if that were the case. than I do not get it. Yesterday, when the nodes sucided itself I lost 24 hours of configuration, so I looked in /var/lib/heartbeat/crm and there was no XML file and I changed the configuration many times, but three resource groups were gone: apache-03-fencing (stonith:external/ipmi):Started apache-04 apache-04-fencing (stonith:external/ipmi):Started apache-03 Resource Group: routing router_ipv4(ocf::heartbeat:IPaddr2): Started apache-03 router_ipv6(ocf::heartbeat:IPv6addr): Started apache-03 openvpn_ipv4 (ocf::heartbeat:IPaddr2): Started apache-03 router_ipv6_transfer (ocf::heartbeat:IPv6addr): Started apache-03 openvpn_glanzmann (ocf::heartbeat:openvpn): Started apache-03 openvpn_ipxechange (ocf::heartbeat:openvpn): Started apache-03 openvpn_eclogic(ocf::heartbeat:openvpn): Started apache-03 openvpn_einwahl(ocf::heartbeat:openvpn): Started apache-03 Resource Group: nfs gcl_fs (ocf::heartbeat:Filesystem):Started apache-04 nfs-common (ocf::heartbeat:nfs-common):Started apache-04 nfs-kernel-server (ocf::heartbeat:nfs-kernel-server): Started apache-04 nfs_ipv4 (ocf::heartbeat:IPaddr2): Started apache-04 Master/Slave Set: ma-ms-drbd0 [drbd0] Masters: [ apache-04 ] Slaves: [ apache-03 ] Resource Group: apache eccar_ipv4 (ocf::heartbeat:IPaddr2): Started apache-04 apache_loadbalancer(lsb:apache2): Started apache-04 Master/Slave Set: ma-ms-drbd1 [drbd1] Masters: [ apache-04 ] Slaves: [ apache-03 ] Resource Group: mail postfix_fs (ocf::heartbeat:Filesystem):Started apache-04 postfix_ipv4 (ocf::heartbeat:IPaddr2): Started apache-04 spamass(lsb:spamass-milter): Started apache-04 clamav (lsb:clamav-daemon):Started apache-04 postgrey (lsb:postgrey): Started apache-04 dovecot(lsb:dovecot): Started apache-04 postfix(ocf::heartbeat:postfix): Started apache-04 This is my cluster, and the mail group was gone, the drbd1 was gone, apache was gone and some resources of the routing group were missing, all the changes were commited in the last 24 hours, after the suicide a grep in the /var/lib/heartbeat/crm and they were not saved. Now I rebooted both nodes and manually exported it to be on the very safe side. I'll collect the log files and provide them crm_report doesn't work for me probably because my syslog location is non default. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes
Hello Andrew, did they ensure everything was flushed to disk first? (apache-03) [/var] cat /proc/sys/vm/dirty_expire_centisecs 3000 So dirty data should be flushed within 3 seconds. But I lost at least 24 hours maybe even more. So it seems that pacemaker / heartbeat does not do persistant changes when I changed the config, which is strange but I'll try to reproduce that in a lab, too. thats not where recent versions of pacemaker keep the cib by default. check /var/lib/pacemaker/cib too The directory does not exist. I'll provide you with the logs this evening. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Andrew, You will need to run crm_report and email us the resulting tarball. This will include the version of the software you're running and log files (both system and cluster) - without which we can't do anything. Find the files here: I manually packaged it because crm_report output was empty. If I forget something, please let me know. I included the daemon syslog output from both nodes from the central syslog server and the crm file, the ha.cf which is the same on both nodes and the /var/lib/heartbeat directory which seems to keep all files from the first node. The reason for the crash in unmanaged mode seems to be the same as before: Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL Probably I should update it. But why the config got lost, I have no idea what went wrong here. https://thomas.glanzmann.de/tmp/linux_ha_crash.2013-08-05.tar.gz Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Ulrich, Did it happen when you put the cluster into maintenance-mode, or did it happen after someone fiddled with the resources manually? Or did it happen when you turned maintenance-mode off again? I did not remember, but checked the log files, and yes I did a config change (I removed apache_loadbalancer from group apache). And that is probably the reason I could not reproduce it in my lab environemnt because I never tried to fiddle with it afterwards.. Probably the way to reproduce it is: put it to maintance-mode and than change something to the config and it crashes, but I have to verify that in my lab and report back. I'll do that right now and report back. ... Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + configuration Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + crm_config Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + cluster_property_set id=cib-bootstrap-options Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + nvpair id=cib-bootstrap-options-maintenance-mode name=maintenance-mode value=true __crm_diff_marker__=added:top / Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + /cluster_property_set Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + /crm_config Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + /configuration Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + /cib Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - cib admin_epoch=0 epoch=94 num_updates=100 Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - configuration Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - resources Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - group id=apache Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - primitive class=ocf id=apache_loadbalancer provider=heartbeat type=apachetg __crm_diff_marker__=removed:top Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - operations Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - op id=apache_loadbalancer-monitor-60s interval=60s name=monitor / Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /operations Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /primitive Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /group Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /resources Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /configuration Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /cib Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + cib epoch=95 num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.6 update-origin=apache-03 update-client=cibadmin cib-last-written=Sun Aug 4 18:49:18 2013 have-quorum=1 dc-uuid=61e8f424-b538-4352-b3fe-955ca853e5fb Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + configuration Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + resources Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + primitive class=ocf id=apache_loadbalancer provider=heartbeat type=apachetg __crm_diff_marker__=added:top Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + operations Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + op id=apache_loadbalancer-monitor-60s interval=60s name=monitor / Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /operations Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /primitive Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /resources Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /configuration Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /cib ... Aug 4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed /usr/lib/heartbeat/crmd process 29398 dumped core Complete syslog is my other e-mail I just sent to Alan, if you want to check it. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Andrew, I just got another crash when putting a node into unmanaged node, this time it hit me hard: - Both nodes sucided or snothined each other - One out of four md devices where detected on both nodes after reset. - Half of the config was gone. Could you help me get to the bottom of this? This was on Debian Wheezy. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes
Hello, both nodes of my ha cluster just paniced, afterwards the config was gone. Is there a command to force heartbeat / pacemaker to write the config to the disk or do I need to restart heartbeat for persistant changes. The config was at least 24 hours on the node, but I did not restart heatbeat on the node. Or should I always (which I now start doing again) do manual backups such as: sudo crm configure show cluster.crm Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds
Hello Andrew, If you include a crm_report for the scenario you're describing, I can take a look. The config alone does not contain enough information. I tried to reproduce that on a Debian Wheezy (7.0) in my lab environment and was unable to do so. I'll soon setup multiple other platforms and will collect the crm_report if I trigger it again and post it. This is the second problem I was unable to reproduce in my lab environment. Very frustrating. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] custom script status)
Hello Mitsuo, from the output you send, you should update because your heartbeat version looks very very ancient to me. A resource script for heartbeat always needs at least these 5 operations: #!/bin/bash . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs export PID=/var/run/postgrey.pid export INITSCRIPT=/etc/init.d/postgrey case $1 in start) ${INITSCRIPT} start /dev/null exit || exit 1; ;; stop) ${INITSCRIPT} stop /dev/null exit || exit 1; ;; status) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit; fi exit 1; ;; monitor) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null { exit 0; } fi exit 7; ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=postgrey version1.0/version longdesc lang=en OCF Ressource Agent for postgrey. /longdesc shortdesc lang=enOCF Ressource Agent for postgrey./shortdesc actions action name=start timeout=90 / action name=stop timeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END ;; esac So start should return 0 when the resource was sucesfully started or already running, otherwise 1. Stop should return 0 when the resource was sucesfully stoped or already stopped, otherwise 1. Status should return 0 if the resource is running, otherwise 1. Monitor should check if the resource is properly working and return 0 on success and 7 on failure. Meta just returns actions and optional paramters and default timeouts intervals and monitoring delays. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Andrew, Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL This is the cause of the coredump. What version of pacemaker is this? 1.1.7-1 Installing pacemaker's debug symbols would also make the stack trace more useful. I'll do that and will get back to you. I tried to reproduce the issue in my lab by installing two Debian Wheezy VMs and reconstruct the the network and ha config, but was unable to do so. What I wonder is that the issue on the production system showed up multiple times (at least 3 times). Rolf, could you please do a apt-get install pacmaker-dev and see if the backtrace reveals a little bit more? Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Does drbd need re-start after configuration change ?
Hello Fredrik, * Fredrik Hudner fredrik.hud...@gmail.com [2013-06-07 14:03]: Been trying to figure out if drbd which is monitored by HA, needs a restart if you do a configuration change in global_common.conf? http://www.drbd.org/users-guide/s-reconfigure.html So you need to issue a 'drbdadm adjust resource'. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello Andrew, Installing pacemaker's debug symbols would also make the stack trace more useful. we tried to install heartbeat-dev to see more, but there are no debugging symbols available. Also I tried to reproduce the issue with a 64 bit Debian Wheezy as I used 32 bit before, I was not able to reproduce the issue. However in the near future I'll setup 6 more Linux HA clusters using Debian Wheezy, I'll report back if the issue happens to me again. On the system where I can reproduce the problem, I'll not do any more experiements because it is about to go into production and except for the maintance part everything works perfectly fine. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] custom script status)
Hello Mitso, 3.0.4-1.el6 from the version I see that you're runing RHEL 6. So RHEL uses corosync or cman but not heartbeat as messaging bus between the nodes. You can follow this guide and the links in this guide. http://clusterlabs.org/quickstart-redhat.html What is annoying from my point of view is that is if I understood Andrews blog Red Hat has removed the crm shell, so you have to use pcs, personally I prefer heartbeat and pacemaker, but that with Red Hat it is a challenge, because than you could use the EPEL repositories, but they're incompatible with the pacemaker shipped from Red Hat, so you end up compiling it by yourself, also 2 years back I setup a cluster for Siemens and I noticed the limitations of corosync. At that time it could only handle two heartbeat links, however they hopefully have fixed that by now. I never tried cman with something else than ricci and luci (the old RHEL clusterstack). Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds
Hello, on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When putting multiple filesystems which depend on multiple drbd promotions, only the first drbd is promoted and the group never comes up. However when the promotions are not based on the individual filesystems but on the group or probably any single entity all drbds are promoted correctly. So to summarize: This only promotes the first drbd and the resource group never starts: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start # ~~ This works: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start # ~~ I would like to know if that is supposed to happen. If that is the case I would understand why this is the case. I assume it is a bug, but I'm not sure. Complete working config here: primitive astorage_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \ op monitor interval=60s primitive astorage1-fencing stonith:external/ipmi \ params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage2-fencing stonith:external/ipmi \ params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage_16_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \ op monitor interval=60s primitive drbd10 ocf:linbit:drbd \ params drbd_resource=r10 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd10_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd10 directory=/mnt/akvm/nfs fstype=ext4 \ op monitor interval=60s primitive drbd3 ocf:linbit:drbd \ params drbd_resource=r3 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd4 ocf:linbit:drbd \ params drbd_resource=r4 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5 ocf:linbit:drbd \ params drbd_resource=r5 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd5 directory=/mnt/apbuild/astorage/packages fstype=ext3 \ op monitor interval=60s primitive drbd6 ocf:linbit:drbd \ params drbd_resource=r6 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8 ocf:linbit:drbd \ params drbd_resource=r8 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4 \ op monitor interval=60s primitive drbd9 ocf:linbit:drbd \ params drbd_resource=r9 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd9_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd9 directory=/exports fstype=ext4 \ op monitor interval=60s primitive nfs-common ocf:heartbeat:nfs-common \ op monitor interval=60s primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \ op monitor interval=60s primitive target ocf:heartbeat:target \ op monitor interval=60s group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common nfs-kernel-server astorage_ip astorage_16_ip target \ meta target-role=Started ms ma-ms-drbd10 drbd10 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd3 drbd3 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd4 drbd4 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd5 drbd5 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd6 drbd6 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd8 drbd8 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd9 drbd9 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started location astorage1-fencing-placement astorage1-fencing -inf: astorage1 location astorage2-fencing-placement astorage2-fencing -inf: astorage2 location cli-standby-astorage
[Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Hello, over the last couple of days, I setup an active passive nfs server and iSCSI storage using drbd, pacemaker, heartbeat, lio and nfs kernel server. While testing cluster I was often setting it to unmanaged using: crm configure property maintenance-mode=true Sometimes when I did that, both nodes or the standby node, suicided itself because /usr/lib/heartbeat/crmd was crashing. I can reproduce the problem easily. It even happened to me with a two node cluster having no resources at all. If you need more information, drop me an e-mail. Highlights of the log: Jun 6 10:17:37 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL Jun 6 10:17:37 astorage1 heartbeat: [2863]: WARN: Managed /usr/lib/heartbeat/crmd process 2947 killed by signal 11 [SIGSEGV - Segmentation violation]. Jun 6 10:17:37 astorage1 ccm: [2942]: info: client (pid=2947) removed from ccm Jun 6 10:17:37 astorage1 heartbeat: [2863]: ERROR: Managed /usr/lib/heartbeat/crmd process 2947 dumped core Jun 6 10:17:37 astorage1 heartbeat: [2863]: EMERG: Rebooting system. Reason: /usr/lib/heartbeat/crmd See the log: Jun 6 10:17:22 astorage1 crmd: [2947]: info: do_election_count_vote: Election 4 (owner: 56adf229-a1a7-4484-8f18-742ddce19db8) lost: vote from astorage2 (Uptime) Jun 6 10:17:22 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_NOT_DC - S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Jun 6 10:17:27 astorage1 crmd: [2947]: info: update_dc: Set DC to astorage2 (3.0.6) Jun 6 10:17:28 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=astorage2/crmd/210, version=0.9.18): ok (rc=0) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Jun 6 10:17:28 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_PENDING - S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd3:0 (1) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd10:0 (1) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd8:0 (1) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd6:0 (1) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd5:0 (1) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd9:0 (1) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd4:0 (1) Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[35] on astorage2-fencing for client 2947, its parameters: hostname=[astorage2] userid=[ADMIN] CRM_meta_timeout=[2] CRM_meta_name=[monitor] passwd=[ADMIN] crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval=[6] cancelled Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation astorage2-fencing_monitor_6 (call=35, status=1, cib-update=0, confirmed=true) Cancelled Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[36] on drbd10:0 for client 2947, its parameters: drbd_resource=[r10] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd10:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[2] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd10:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resour cancelled Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd10:0_monitor_31000 (call=36, status=1, cib-update=0, confirmed=true) Cancelled Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[37] on drbd3:0 for client 2947, its parameters: drbd_resource=[r3] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd3:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[2] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd3:0 ]
Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds
Hello Emmanuel, * emmanuel segura emi2f...@gmail.com [2013-06-06 11:12]: order drbd_fs_after_drbd inf: ma-ms-drbd5:promote ma-ms-drbd8:promote astorage:start I can see that you promoted multiple drbds in one line. My config where I promote them individually also works. However my question, was why is it not possible to promote on a per filesystem basis. But only when having multiple drbd promotions in one group. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to fix ERROR: Cannot chdir to [/var/lib/heartbeat/cores/hacluster]: Permission denied?
Hello Shuwen, What functionality of dir /var/lib/heartbeat/cores/hacluster? if a component of heartbeat crashed, the core files are kept in this directory to do post portem analysis of the problem. How to fix this error print? What is your advice? Fix the permissions. For me the permissions are: chown hearbeat user /var/lib/heartbeat/cores/hacluster For me on Debian that is: chown hacluster /var/lib/heartbeat/cores/hacluster Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failed actions
Hello Andrew, In this case, it is the initial monitor (the one that tells pacemaker what state the service is in before we try to start anything) that is failing. For the ones returning rc=1, it looks like something was wrong but the cluster was able to clean them up (by running stop) and start them again. I see, thanks. crm resource cleanup all That should work. I'll try that and report back. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat IPv6addr OCF
Hello, ipv6addr=2600:3c00::0034:c007 from the manpage of ocf_heartbeat_IPv6addr it looks like that you have to specify the netmask so try: ipv6addr=2600:3c00::0034:c007/64 assuiming that you're in a /64. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat IPv6addr OCF
Hello Nick, Thanks for the tip, however, it did not work. That's actually a /116. So I put in 2600:3c00::0034:c007/116 and am getting the same error. I requested that it restart the resource as well, just to make sure it wasn't the previous error. now, I had to try it: node $id=9d9b62d2-405d-459a-a724-cb2643d7d9a1 node-62 primitive ipv6test ocf:heartbeat:IPv6addr \ params ipv6addr=2a01:4f8:bb:400::2/64 \ op monitor interval=15 timeout=15 \ meta target-role=Started property $id=cib-bootstrap-options \ dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \ cluster-infrastructure=Heartbeat \ stonith-enabled=false And it works: (node-62) [~] ifconfig eth0 Link encap:Ethernet HWaddr 00:25:90:97:db:b0 inet addr:10.100.4.62 Bcast:10.100.255.255 Mask:255.255.0.0 inet6 addr: 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 Scope:Global inet6 addr: fe80::225:90ff:fe97:dbb0/64 Scope:Link inet6 addr: 2a01:4f8:bb:400::2/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:40345 errors:0 dropped:0 overruns:0 frame:0 TX packets:10270 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:52540127 (50.1 MiB) TX bytes:1127817 (1.0 MiB) Memory:fb58-fb60 (infra) [~] traceroute 2a01:4f8:bb:400::2 traceroute to 2a01:4f8:bb:400::2 (2a01:4f8:bb:400::2), 30 hops max, 80 byte packets 1 merlin.glanzmann.de (2a01:4f8:bb:4ff::1) 1.413 ms 1.550 ms 1.791 ms 2 2a01:4f8:bb:400::2 (2a01:4f8:bb:400::2) 0.204 ms 0.202 ms 0.270 ms Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat IPv6addr OCF
Hello Nick, Anything I need to do to allow IPv6... or something? I agree with Greg here. Have you tried setting the address manually? ip -6 addr add ip/cidr dev eth0 ip -6 addr show dev eth0 ip -6 addr del ip/cidr dev eth0 ip -6 addr show dev eth0 (node-62) [~] ip -6 addr add 2a01:4f8:bb:400::3/64 dev eth0 (node-62) [~] ip -6 addr show dev eth0 2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qlen 1000 inet6 2a01:4f8:bb:400::3/64 scope global valid_lft forever preferred_lft forever inet6 2a01:4f8:bb:400::2/64 scope global valid_lft forever preferred_lft forever inet6 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 scope global dynamic valid_lft 2591998sec preferred_lft 604798sec inet6 fe80::225:90ff:fe97:dbb0/64 scope link valid_lft forever preferred_lft forever (node-62) [~] ip -6 addr del 2a01:4f8:bb:400::3/64 dev eth0 (node-62) [~] ip -6 addr show dev eth0 2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qlen 1000 inet6 2a01:4f8:bb:400::2/64 scope global valid_lft forever preferred_lft forever inet6 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 scope global dynamic valid_lft 2591990sec preferred_lft 604790sec inet6 fe80::225:90ff:fe97:dbb0/64 scope link valid_lft forever preferred_lft forever Do you see a link local address on your eth0? A link local address is one that starts with fe80:: otherwise try loading the ipv6 module: modprobe ipv6 # Don't know if that is the right module name, all my # kernels have ipv6 build in (Debian wheezy / squeeze / backports) Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat IPv6addr OCF
Hello Nick, I shouldn't be able to do that if the IPv6 module wasn't loaded, correct? that is correct. I tried modifying my netmask to copy yours. And I get the same error, you do: ipv6test_start_0 (node=node-62, call=6, rc=1, status=complete): unknown error So probably a bug in the resource agent. Manually adding and removing works: (node-62) [~] ip -6 addr add 2a01:4f8:bb:400::2/116 dev eth0 (node-62) [~] ip -6 addr show dev eth0 2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qlen 1000 inet6 2a01:4f8:bb:400::2/116 scope global valid_lft forever preferred_lft forever inet6 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 scope global dynamic valid_lft 2591887sec preferred_lft 604687sec inet6 fe80::225:90ff:fe97:dbb0/64 scope link valid_lft forever preferred_lft forever (node-62) [~] ip -6 addr del 2a01:4f8:bb:400::2/116 dev eth0 Nick, you can do the following things to resolve this: - Hunt down the bug and fix it or let someone else do it for you - Use another netmask, if possible (fighting the symptoms instead of resolving the root cause) - Write your own resource agent (fighting the symptoms instead of resolving the root cause) Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Failed actions
Hello, I have an openais installation on centos which has logged failed actions, but the services appear to be 'started'. As I know heartbeat/pacemaker if an action fails the service should not be started. I also have a system on Debian squeeze that stops the service when a monitor action for IPMI has failed. But I also remember that a few years back the ability was built in that allows to retry failed actions after a configurable time, but I never did that. From the out of crm_mon -i 1 -r I assume that the fence_ipmilan agents are running but that there are some failed actions. Can I clean them up the old way using crm_resource -C -r fence-astore1 -H astorage2 crm_resource -C -r fence-astore2 -H astorage1 or crm resource cleanup all ? The output of crm_mon is here: http://pbot.rmdir.de/Qux4BaurFOUOYLfzqJNcfQ The crm config is here: http://thomas.glanzmann.de/tmp/crm_config.txt DRBD config is here: http://thomas.glanzmann.de/.www/tmp/drbd.txt Also I would like to know some feedback on the config, I think the following configuration errors were made: - Stonith and quorum are disabled - Promote and colocation constraints for drbd resources and fs-storage are missing - Peer outdater for drbd is missing and suicide is the wrong approach for the task at hand. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith failed to start
Hello Terry, What would cause the stonith 'start' operation to fail after it initially had succeeded? if my understanding is correct (I wrote a stonith agent for vsphere yesterday). Than it runs the status command of the stonith agent and looks at the exist status, like that: (ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root /usr/lib/stonith/plugins/external/vsphere status; echo $? Enter password: 0 Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] vsphere stonith, squid3 agent for debian lenny and example configuration
Hello, find attached, a vsphere (works with esx-server 3/4 virual center 2.X and 4) stonith plugin, a squid3 resource agent for debian lenny and a example configuration. Thomas use_logd yes bcast eth0 node ha-01 ha-02 watchdog /dev/watchdog crm on #!/bin/sh if [ -z ${OCF_ROOT} ]; then export OCF_ROOT=/usr/lib/ocf/ fi . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs SQUID_PORT=3128 INIT_SCRIPT=/etc/init.d/squid3 PID=/var/run/squid3.pid CHECK_URLS=http://www.google.de/ http://www.glanzmann.de/ http://www.uni-erlangen.de; case $1 in start) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit 0; else rm -f ${PID} fi ${INIT_SCRIPT} start /dev/null 21 exit || exit 1 ;; stop) if [ -f ${PID} ]; then ${INIT_SCRIPT} stop /dev/null 21 exit || exit 1 fi exit 0; ;; status) if [ -f ${PID} ]; then kill -0 `cat ${PID}` exit; fi exit 1; ;; monitor) if [ -f ${PID} ]; then kill -0 `cat ${PID}` || exit 7 else exit 7; fi for URL in ${CHECK_URLS}; do http_proxy=http://localhost:${SQUID_PORT}/ wget -o /dev/null -O /dev/null -T 1 -t 1 ${URL} exit done exit 1; ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=squid version1.0/version longdesc lang=en OCF Ressource Agent on top of squid init script shipped with debian. /longdesc shortdesc lang=enOCF Ressource Agent on top of squid init script shipped with debian./shortdesc actions action name=start timeout=90 / action name=stop timeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END ;; esac #!/usr/bin/perl use strict; use warnings FATAL = 'all'; # Thomas Glanzmann 10:28 09-08-19 # apt-get install libarchive-zip-perl libclass-methodmaker-perl libcompress-raw-zlib-perl libcompress-zlib-perl libcompress-zlib-perl libdata-dump-perl libio-compress-base-perl libio-compress-zlib-perl libsoap-lite-perl liburi-perl libuuid-perl libxml-libxml-perl libxml-libxml-common-perl libxml-namespacesupport-perl libwww-perl # tar xfz ~/VMware-vSphere-SDK-for-Perl-4.0.0-161974.i386.tar.gz # answer all questins with no use lib '/usr/lib/vmware-vcli/apps/'; use VMware::VIRuntime; use AppUtil::VMUtil; sub connect { Opts::parse(); Opts::validate(); Util::connect(); } my $vm_views = undef; sub poweron_vm { foreach (@$vm_views) { my $mor_host = $_-runtime-host; my $hostname = Vim::get_view(mo_ref = $mor_host)-name; eval { $_-PowerOnVM(); Util::trace(0, \nvirtual machine ' . $_-name . ' under host $hostname powered on \n); }; if ($@) { if (ref($@) eq 'SoapFault') { Util::trace (0, \nError in ' . $_-name . ' under host $hostname: ); if (ref($...@-detail) eq 'NotSupported') { Util::trace(0,Virtual machine is marked as a template ); } elsif (ref($...@-detail) eq 'InvalidPowerState') { Util::trace(0, The attempted operation. cannot be performed in the current state ); } elsif (ref($...@-detail) eq 'InvalidState') { Util::trace(0,Current State of the . virtual machine is not supported for this operation); } else { Util::trace(0, VM ' .$_-name. ' can't be powered on \n . $@ . ); } } else { Util::trace(0, VM ' .$_-name. ' can't be powered on \n . $@ . ); } Util::disconnect(); exit 1; } } } sub poweroff_vm { foreach (@$vm_views) { my $mor_host = $_-runtime-host; my $hostname = Vim::get_view(mo_ref = $mor_host
Re: [Linux-HA] Automatic Clenaup of certain resources
Hello Andrew, * Andrew Beekhof [EMAIL PROTECTED] [080117 09:13]: On Jan 17, 2008, at 7:34 AM, Thomas Glanzmann wrote: I use Linux HA to monitor some services on a dial in machine. A so called single node lcuster. For example sometimes my dial-in connection or openvpn connection, or IPv6 connectivity does not come. Is there a way to tell Linux-HA to retry a failed resource after a certain amount of time again? not yet but soon only in the last few days has the lrmd started exposing the timing data required in order to do this is this possible today? Can someone give me a short walk-through? I had yesterday a problem where one of my tomcats didn't came because of heavy load and I had to cleanup the ressource manually. A retry every 10 minutes or every minute would have been sufficient. What do I have to do to get recent linux-ha packages for Debian Etch? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Automatic Clenaup of certain resources
Hello Andrew, is this possible today? yes but only with pacemaker 0.7 thanks a lot I found the configuration option failure-timeout=60s Can someone give me a short walk-through? Look for Migrating Due to Failure in http://clusterlabs.org/mw/Image:Configuration_Explained_1.0.pdf http://download.opensuse.org/repositories/server:/ha-clustering:/UNSTABLE/Debian_Etch/ (same heartbeat package as http://download.opensuse.org/repositories/server:/ha-clustering/Debian_Etch/ but also has pacemaker 0.7) thanks a lot! I try them out and come back with the result. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Announcement: heartbeat/pacemaker documentation in hg
Hello Dejan, http://hg.clusterlabs.org/pacemaker/doc/archive/tip.tar.gz I am unable to build this: (ad027088pc) [/var/tmp/Pacemaker-Docs-80da5f68a837] make /usr/lib/ocf/resource.d/heartbeat/AudibleAlarm: line 19: /resource.d/heartbeat/.ocf-shellfuncs: No such file or directory -:1: parser error : Document is empty ^ -:1: parser error : Start tag expected, '' not found ^ I/O error : Invalid seek unable to parse - xml/ra-AudibleAlarm.xml:1: parser error : Document is empty ^ xml/ra-AudibleAlarm.xml:1: parser error : Start tag expected, '' not found ... are there any precompiled pdfs around? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat failover not working on hard drive error
Hello Coach-X (what a strange name), This has happened several times. Nothing shows up in either log file, and a hard reboot brings the master back online. Is this caused by the serial link still being active? Is there a way to have this type of issue cause the slave to become active? exactly. I personally use 3-ware raid controllers with a raid-1 (mirror) configured. I monitor these controllers with nagios and switch disks within 2 days, if one dies. But you could also use a linux software raid and _sata_ not _pata_ disks to obtain the above. Another way to detect disk-failures would be a ressource agent who does something like invalidating the buffer cache and run a find or ls on the filesystem. And put that resource agent into your group that contains exim. The monitor action would be something like that if -f /var/run/ressource-agent is running; then sync; echo 3 /proc/sys/vm/drop_caches ls / /dev/null exit 0 || exit 1 else exit 7; fi See also http://linux-mm.org/Drop_Caches I assume you use linux, but if you don't find a reasonable supported raid controller for your hardware architecture / os. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA maintenance mode
Hello Danny, Would be really nice to have that as cluster command in HA or as hb_gui feature already available. Or just a switch to enable/disable failover for mainteance purpose. it is already there. It is the default policy. I just don't bother to look it up in the manual, but maybe you are happy enough that someone else will raise the word, otherwise you have to look it up by yourself. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] VLAN Trunk, IPaddr2, and static routes...
Hello Chris, there is no need to put the vlan logic into the resource agent. Just configure the interface _before_ and use it _afterwards_. I have it running for ages on two different machines and it just works. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] external/ipmi example configuration
Hello Martin, it is pure luck that I am so bored that I read this list, next time CC me. :-) I have read several postings in the mail archive about the external/ipmi configuration but there are still some questions that bother me. The last posting from Thomas: did this cib-configuration worked with your 2-node cluster? I have to configure also 2 nodes and would like to use the ipmi-plugin but I am unsure if I understand what the plugin really does. I have the following configuration on two systems and I verified that this configuration works as it should be. Someone on this list told me that I can drop the location constraints, however I decided to keep them until I verified that. configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair name=stonith-enabled value=true id=stonith-enabled/ nvpair name=stonith-action value=reboot id=stonith-action/ /attributes /cluster_property_set /crm_config resources primitive id=apache-01-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=apache-01-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=apache-01-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes id=ia-apache-01-fencing attributes nvpair id=apache-01-fencing-hostname name=hostname value=apache-01/ nvpair id=apache-01-fencing-ipaddr name=ipaddr value=172.18.0.101/ nvpair id=apache-01-fencing-userid name=userid value=Administrator/ nvpair id=apache-01-fencing-passwd name=passwd value=whatever/ /attributes /instance_attributes /primitive primitive id=apache-02-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=apache-02-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=apache-02-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes id=ia-apache-02-fencing attributes nvpair id=apache-02-fencing-hostname name=hostname value=apache-02/ nvpair id=apache-02-fencing-ipaddr name=ipaddr value=172.18.0.102/ nvpair id=apache-02-fencing-userid name=userid value=Administrator/ nvpair id=apache-02-fencing-passwd name=passwd value=whatever/ /attributes /instance_attributes /primitive /resources constraints rsc_location id=apache-01-fencing-placement rsc=apache-01-fencing rule id=apache-01-fencing-placement-rule-1 score=-INFINITY expression id=apache-01-fencing-placement-exp-02 value=apache-02 attribute=#uname operation=ne/ /rule /rsc_location rsc_location id=apache-02-fencing-placement rsc=apache-02-fencing rule id=apache-02-fencing-placement-rule-1 score=-INFINITY expression id=apache-02-fencing-placement-exp-02 value=apache-01 attribute=#uname operation=ne/ /rule /rsc_location /constraints /configuration I killed heartbeat with -9 to simulate a node failure. To configure the plugin, I will create a resource for every node. This means, two additional resources in my cib.xml because I have two cluster-nodes. Correct. The attributes (nvpair) define variables for the ipmi-script, e.g. hostname... But what does the constraints tell me? If #uname is not equal tovalue then the score ist -INFINITY, i.e. the resource will never be started on that node? you pin ,,apache-01-fencing'' on apache-02 and ,,apache-02-fencing'' on apache-01. So that the resource that can stonith apache-01 runs on apache-02 and vice versa. Someone stated that heartbeat is able to do a suicide (stonith itself) but that isn't true at least not via stonith and not in version 2.1.3. The location constraints seems to be unneccessary because if the fencing is running on the wrong node and that node misbehaves, it is restartet on the remainding node and than shoots the misbehaving. However
Re: [Linux-HA] Compiling Heartbeat on Solaris10
Hello Ken, I am having trouble compiling Heartbeat 2.0.7 on a Solaris 10 system. I have tried SunStudio11 and gcc 3.3 and 4.0. Is there any information I can read that might help? It's complaining about Gmain_timeout_funcs in lib/clplumbing/GSource.c, if anyone has seen that before. first of all use version 2.1.3. I am going to compile heartbeat for Solaris myself. But it might take some time. If I am done, I am going to publish my Solaris packages. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Removing a node from cluster
Hello Franck, Suppose I have a 3 nodes cluster: node1, node2, node3. I want to remove node2 from the cluster to be able to perform various operation on the node2 without any risk of ressources moving to node2. I tried to figure out with the cibadmin or crm_ressource but I don't get it. # Put node into standby mode crm_standby -U node2 -v on # Make node active again (the two commands have the same effect) crm_standby -D -U node2 Or you go to that node and type. /etc/init.d/heartbeat stop Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource script question (runlevel config)
Hello Amy, What about something like monit to make sure ssh is up and running and restart if it crashes? thanks for the pointer. A very interesting tool. I was looking for something like that but decided to write something by myself but it sounds great maybe I will give it a try. http://www.tildeslash.com/monit/ Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ClusterIP
Hello, I would like to do a Cluster-IP Setup with SLES 10. A few things are unclear for me. With ClusterIP you have one IP address that is shared on two or more nodes. It useally uses a multicast mac address. Both nodes see all traffic. But when one node goes down how does the other node see that it has to handle all the traffic right now and not only a part of it? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ClusterIP
Hello, thank you a lot for the feedback! Now I understand how the failover works. Has someone a ready to use cib.xml that I can use for testing. I am going to try my luck right now and come back in an hour or so with my findings. It would be nice if someone could comment on them. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] propagate value similliar to pingd
Hello, I would like to write a script similiar to pingd that is spawnd and populates a value in the cib that I can build a rule on. What do I have to do to obtain the above. Concrete questions are: - What do I have to put in the cib to spawn such an 'agent'? - How do I propagate the value into the cib? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ClusterIP
Hello again, here comes by cib.xml for a clusterip. But the ressource stickiness is not working for me. When I shoutdown ha-2, the two clone instances stay on ha-1. Any ideas? Before sending this e-mail I used the following command to set some location constraints: crm_resource -M -r ip0:0 -H ha-1 of course you can't do that when ip0:0 is already on ha-1 because crm_resource goes smart ass on me: (ha-1) [~] crm_resource -M -r ip0:0 -H ha-1 Error performing operation: ip0:0 is already active on ha-1 However it seems that location constraints or preferences are totally fine with cloned ressources. So it doesn't seem that I do need the ressource_stickiness. It doesn't work for me anyway. And that answers my other question, I guess. And the ressources come back on the right host after I simulate a power failure. Nice feature. I like it very much. (ha-1) [~] crm_mon -1 -r Last updated: Thu Feb 7 22:25:12 2008 Current DC: ha-1 (330da1b6-5f99-480a-b071-a144a98e1248) 2 Nodes configured. 1 Resources configured. Node: ha-2 (095256ab-361c-4b1e-9a8b-8bed74c4a7fb): online Node: ha-1 (330da1b6-5f99-480a-b071-a144a98e1248): online Full list of resources: Clone Set: clusterip-clone ip0:0 (heartbeat::ocf:IPaddr2): Started ha-1 ip0:1 (heartbeat::ocf:IPaddr2): Started ha-1 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair name=ressource_stickiness value=0 id=ressource-stickiness/ /attributes /cluster_property_set /crm_config resources clone id=clusterip-clone meta_attributes id=clusterip-clone-ma attributes nvpair id=clusterip-clone-1 name=globally_unique value=false/ nvpair id=clusterip-clone-2 name=clone_max value=2/ nvpair id=clusterip-clone-3 name=clone_node_max value=2/ /attributes /meta_attributes primitive class=ocf provider=heartbeat type=IPaddr2 id=ip0 instance_attributes id=ia-ip0 attributes nvpair id=ia-ip0-1 name=ip value=157.163.248.193/ nvpair id=ia-ip0-2 name=cidr_netmask value=25/ nvpair id=ia-ip0-3 name=nic value=eth0/ nvpair id=ia-ip0-4 name=mac value=01:02:03:04:05:06/ nvpair id=ia-ip0-5 name=clusterip_hash value=sourceip-sourceport/ /attributes /instance_attributes operations op id=ip0-monitor0 name=monitor interval=60s timeout=120s start_delay=1m/ /operations /primitive /clone /resources /configuration Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ClusterIP
Hallo Lars Uhm, what do you think should happen when you shutdown ha-2 - of course they stat on ha-1 in that case? I meant that I shut it down temporarily and if it comes back again the clones stay both on one node instead of going back again. I don't know what you're saying here ;-) I said. That ressource_stickiness=0 does not work for me so I used a location constraint to put ip:0 on ha-1 and ip:1 on ha-2 to get the behaviour. But as I write this e-mail I realize that I misspelled resource_stickiness. nvpair name=ressource_stickiness value=0 id=ressource-stickiness/ With resource stickiness, this should be spread across two nodes? Sure thing, if I manage to write it correctly. :-) This setting is wrong. globally_unique must be true for the cluster ip. Your configuration doesn't really work ;-) Okay. That is the right moment to ask what globally_unique is about anyway? I never got it. I just copy and pasted it. nvpair id=clusterip-clone-2 name=clone_max value=2/ You can drop this line, it defaults to the number of nodes anyway - unless, of course, you want to make it larger so you can do more fine grained load control later. Thanks. I will do that. nvpair id=ia-ip0-4 name=mac value=01:02:03:04:05:06/ That's not a valid multicast MAC. I see. I thought every mac address that starts with the first bit set to one is a multicast MAC address. However I used an autogenerated, too. And I got it working. But only on the same network. It seems that I have to set a static entry on the default router to really get it working. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Samba and High Availability
Hello Christopher, Everything I have read about samba and HA made it seem like this was not possible. Are others doing this too? Can you think of some good tests to try to stress it (short of accessing a database or something). I imagine a fail-over during a large copy operation would fail, and I'll test that tomorrow. But for the moment, I'm just so psyched I had to tell somebody, and the dog couldn't care less. ;) well I am not a dog, but I do in fact care. So could you please elaborate a bit and post your cib.xml configuration and your ra for samba? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD 8.0 under Debian Etch?
Hello Fabiano, Short question: Does anyone here have DRBD8 running with heartbeat under Etch? I do and it works like a charm. Search the archives for the complete config or drop me an e-mail and I resend it to you with a few things you should obey I to get a perfect drbd setup. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] About configuring DRBD v8 on HA v2
Hello Stefano, it is not possible to configure drbd in a master/slave through the gui. For a walkthrough use one of the following: http://article.gmane.org/gmane.linux.highavailability.user/22132 Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and Pingd
Hello Dominik, You can also start pingd from ha.cf with a respawn directive. Just as Steve did it. Works fine here and imho has the advantage of a pingd value being calculated when the constraints are applied (because pingd starts right away and not just when the crm comes alive). I see. My bad. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and Pingd
Hello Steve, attached is a working example for a postgres cluster. Put your filesystem, ip, database thing in a ressource group and drop the colocation and order constraints or you have to define your order rules on two directions. See also this thread: http://article.gmane.org/gmane.linux.highavailability.user/21811 Thomas use_logd yes bcast eth1 mcast eth0.2 239.0.0.2 694 1 0 node postgres-01 postgres-02 respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd uid=hacluster gid=haclient watchdog /dev/watchdog ping 172.17.0.254 crm on postgres.xml Description: XML document ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and Pingd
Hi Steve, your cib.xml isn't working because you forget to propagate the pingd values. You have forgot to add the pingd clone ressource to your cib.xml. Common mistake I did it once by myself so your scores don't get propagated: put that in your ressources section in the cib: clone id=pingd-clone meta_attributes id=pingd-clone-ma attributes nvpair id=pingd-clone-1 name=globally_unique value=false/ /attributes /meta_attributes primitive id=pingd-child provider=heartbeat class=ocf type=pingd operations op id=pingd-child-monitor name=monitor interval=20s timeout=60s prereq=nothing/ op id=pingd-child-start name=start prereq=nothing/ /operations instance_attributes id=pingd_inst_attr attributes nvpair id=pingd-1 name=dampen value=60s/ nvpair id=pingd-2 name=multiplier value=100/ /attributes /instance_attributes /primitive /clone Doublecheck that the value gets propagated: (apache-01) [~] cibadmin -Q | grep name=\pingd\ | grep value nvpair id=status-f5707ca9-2673-4edb-80e6-d7700efbd7f3-pingd name=pingd value=100/ nvpair id=status-47923b94-150d-45d5-a7f4-01f1aa607484-pingd name=pingd value=100/ Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] what to do on loss of network
Hello Kettunen, I have SLES10 SP1 HA 2.0.8 split site two node cluster and I've configured pingd clone resource to make resource location constrains. It works very well. My ping node is Iscsi server in third site from where cluster node mounts its resource disk. If I disconnect all communication paths between nodes active node correctly stops resource because it loses also ping node connection. But there is definetly split brain going on (both think they are DC). I think that is okay. If you have a two node cluster, and the two nodes can't talk for each other for 90 seconds (or whatever the default timeout is) they assume the DC status. The only way around is to configure quorum but to be honest I never found out how to configure quorum. Maybe somone could give me a walkthrough. I won't have the split brain scenario, because I have two redundant communications links between my nodes (switch and cross link cable) maybe I am going to add a serial line, too. But for the time being it is more than fine. I also monitor my whole setup using nagios: - drbd - if a dc is choosen and if all ressources are online - heartbeat links Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: FW: [Linux-HA] what to do on loss of network
Hallo Kettunen, Correction. I ment to say that splitbrain detection should be done when nodes see each other again (even at network level). CRM status messages do move when connection between nodes is back, but other node don't accept messages from other node. I aggree with you. They should stop their ressources and renogitiate or the other way around. But to be honest, I never tried such a situation but I can easily do. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ordering constraints and node crash
Hello Marc, If I kill the node hosting postgresr2, postgresr2 migrates to another node, but applisr1 and applisr3 aren't restarted. Is it normal ? What could I do to solve this ? the answer to your question is 'resource group'. A resource group is a container for resources. Every resource in a resource group is started and stopped in order and they always have to run on the same host. If you want to build a resource group by yourself, you need order and colocation constraints but more than the obvious. See this thread: http://article.gmane.org/gmane.linux.highavailability.user/21811 Example (from my postgres database): group id=postgres-cluster primitive class=ocf provider=heartbeat type=Filesystem id=fs0 instance_attributes id=ia-fs0 attributes nvpair id=ia-fs0-1 name=fstype value=ext3/ nvpair name=directory id=ia-fs0-2 value=/srv/postgres/ nvpair id=ia-fs0-3 name=device value=/dev/drbd0/ /attributes /instance_attributes operations op id=fs0-monitor0 name=monitor interval=60s timeout=120s start_delay=1m/ /operations /primitive primitive class=ocf provider=heartbeat type=IPaddr2 id=ip0 instance_attributes id=ia-ip0 attributes nvpair id=ia-ip0-1 name=ip value=172.17.0.20/ nvpair id=ia-ip0-2 name=cidr_netmask value=24/ nvpair id=ia-ip0-3 name=nic value=eth0.2/ /attributes /instance_attributes operations op id=ip0-monitor0 name=monitor interval=60s timeout=120s start_delay=1m/ /operations /primitive primitive class=ocf provider=heartbeat type=pgsql id=pgsql0 instance_attributes id=ia-pgsql0 attributes nvpair id=ia-pgsql0-1 name=pgctl value=/usr/lib/postgresql/8.1/bin/pg_ctl/ nvpair id=ia-pgsql0-2 name=start_opt value=--config_file=/srv/postgres/etc/postgresql.conf/ nvpair id=ia-pgsql0-3 name=pgdata value=/srv/postgres/data/ nvpair id=ia-pgsql0-4 name=logfile value=/srv/postgres/postgresql.log/ /attributes /instance_attributes operations op id=pgsql0-monitor0 name=monitor interval=60s timeout=120s start_delay=1m/ op id=pgsql0-start0 name=start timeout=120s prereq=nothing/ /operations /primitive /group Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HOWTO: Build a high available iscsi Target using heartbeat, drbd and ietd for ESX Server 3.5
Hello Dejan, Nice effort. Thanks for sharing it. Perhaps you'd like to put this into the wiki.linux-ha.org. If you do, don't forget to pepper the doc with YMMV. I am going to do that. 1. cib is a bit too lean. There are no attributes set for the ietd resource. Well I have that default ressource agent, which I use for all kind of scenarios and I just adopt it. The only time I used attributes is when I had to. That was openvpn (I have two instances running and they use different config files). But I am going to fix that. 2. ietd RA is Linux specific. If it has to be then you should check if it runs on Linux and if not bail out with an appropriate message. It is Linux specific (at least to my knowldege). I will add that error message. 3. I understand that fixing various memory parameters is important for ietd performance, but that has no place in the RA. Placing those settings in the XML info as comment should suffice. The admins may choose different settings anyway. Actually I don't have a clue. I just looked at the example init script that was provided by in the ietd distribution and adopted the information given in there. 4. There are various modprobe statements. Is that necessary? It should be better to assume that ietd init script has been run and then just add/remove new targets using the RA. I am unaware if that is possible. But most of the time (apache, nfs server) I work by shutting the whole thing down and start it elsewhere even if it would be possible to it on a per lun basis. 5. Is this RA just an example? Yes, a working example. But to be honest I have four or three of this RAs in production. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HOWTO: Build a high available iscsi Target using heartbeat, drbd and ietd for ESX Server 3.5
Trent, I just did a very similar thing, except in my case I am using shared storage (MD3000 - SAS) and theres a bit more fun to that part of it (multipath, stonith, etc) - also I setup heartbeat in v1 mode not CRM mode. nice, I neve had a MD3000 on my hands. I plan to post a walkthrough at some point in the future (I also setup SMB and NFS based storage) if anyone is interested. Well, I am for sure interested. I plan to provide some education on linux-ha myself, because for me it wasn't that easy to understand the existing resources. Interesting about the ScsiSN thing - I didnt see that readme file and what I did to solve that problem was use a different Lun number on each target.. but this solution makes much more sense - thanks for the tip. I would always use different LUN numbers, too. However in the future I am going to use different LUN numbers and different ScsiSNs. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCF test script (ocf-tester)
Hello Jeff, Please find attached the Nagios OCF script I wrote. thank you for sharing. monitor_nagios(){ case ${NAGIOSRUNNING} in yes) if [ -f ${OCF_RESKEY_pid} ]; then echo ${0} MONITOR: running exit 0 fi ;; no) if [ -f ${OCF_RESKEY_pid} ]; then echo ${0} MONITOR: failed exit 7 else echo ${0} MONITOR: stopped exit 7 fi ;; *) echo ${0} MONITOR: unknown status exit 1 ;; esac } something that popped in my eyes. The monitor status should return 0 if the resource is running 7 if it stopped and anything else if it is failed Source: http://www.linux-ha.org/OCFResourceAgent But your resource agent returns 7 when it failed. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCF test script (ocf-tester)
Hello Jeff, I am attempting to write an OCF compliant script for nagios. I have followed the documentation here: I attached the one I am using. Keep me posted if you do something different. Thomas #!/bin/bash . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs export PID=/var/run/nagios2/nagios2.pid export CONFIGFILE=/etc/nagios2/nagios.cfg export EXECUTABLE=/usr/sbin/nagios2 case $1 in start) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit 0; else rm -f ${PID} fi find /var/lib/nagios2/ -type f -print0 | xargs -0 rm ${EXECUTABLE} -d ${CONFIGFILE} ;; stop) if [ -f ${PID} ]; then kill `cat ${PID}` /dev/null fi rm -f ${PID} exit 0; ;; status) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit 0; fi exit 1; ;; monitor) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit 0; fi exit 7; ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=nagios version1.0/version longdesc lang=en OCF Ressource Agent for Nagios. /longdesc shortdesc lang=enOCF Ressource Agent for Nagios./shortdesc actions action name=start timeout=90 / action name=stop timeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END ;; esac ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Hello, You don't need location constraints. okay. Could elaborate please? Does the stonith subsystem automatically know where to put them? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Hello Lars, If the node fails, and the other side needs STONITH, the resource will be started in that partition automatically. The location constraints don't hurt, but you don't need them. STONITH resources get started before any STONITH operation is performed, which has roughly the same effect. I see. Okay than I remove them. I just did them because I did not understand what happens but scratched on the uppest layer. :-) And yes, on a stop failure, a node might decide to fence itself too. As stonithd is network aware, it doesn't matter where exactly in the cluster the STONITH resource runs. Good point. During my dozens of test setup at the beginning I had in fact heartbeat instances that locked themselfes on a 'reboot' but I never seen this problem again. Thanks for the elaboration on this topic. I hope that I have a good stonith implementation but I thought about making stonith high available itself: # # # # # # # # # # # # # # /# Switch 01 #-# Switch 02 #\ / # # # # # # # # # # # # # # \ | | | | | # # # # # # # # # # # # # # | | # Stonith 1 #\ /# Stonith 2 # | | # # # # # # # \ / # # # # # # # | | | / | | \ # # # # # # # / \ # # # # # # # / \# apache 01 #/ \# apache 02 #/ # # # # # # # # # # # # # # A Stonith Device would like this: Atmel + Ethernet Controller + a few optocoupler. It would get a broadcast or multicast udp frame, resets a component and sends an acknowldege back. On the heartbeat site there would be an application written n c which opens a udp socket, sends the request and waits 5 Seconds for one asnwer. If it receives it, the stonith worked, otherwise not. A friend of mine could build the hardware in 5 days from scratch and I could write the software in 2 hours or so. Just a thought. Let's see if it gets reality. One stonith device has ~ 10 reset lines. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Lars, Assuming that the fencing device can be reached from all nodes, it doesn't matter where they are put. Only if you have, say, a serial power switch which is only reachable from one node do you need location constraints. I have a two node cluster. I use external/ipmi which needs one instance per node. A node that is misbehaving can't stonith itself, can it? Is linux-ha so smart to see that the one stonith resource has to run on the one node and the other on the other node? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Hello Dejan, http://developerbugs.linux-foundation.org/show_bug.cgi?id=1752 According to this, it does matter. There really is a check in stonithd which prevents a node to stonith itself. So, I'd say that there should be a location constraint which says not to run a stonith resource on the same node which is to be fenced by that stonith resource. Otherwise, the stonith resource is going to be started, but it won't do its job should the need arise. Thank you a lot for the clarification. So I let my setup as it is. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Supervise but don't stop a resource
Hello, is it possible with linux-ha to supervice (monitor) a resource and start it when it failed, but do not stop it when heartbeat is stopped? I am thinking about the syslog daemon and sshd. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Colocations and orders
Hello Jochen, [ RESEND: Previous CIB was crap ] ipaddr drbd filesystem (for mounting drbd) apache tomcat find a cib.xml attached. I also attached two resource agents that I wrote myself and run on Debian Etch. Adapt for your need (hostnames and ip address; mountmount; drbd ressource). I hope that gets you going. Oh and a few more information: ha.cf: use_logd yes bcast eth1 mcast eth0.2 239.0.0.1 694 1 0 mcast eth0.3 239.0.0.1 694 1 0 node apache-01 apache-02 watchdog /dev/watchdog respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd uid=hacluster gid=haclient ping 62.146.78.1 crm on drbd.conf: global { usage-count no; } common { syncer { rate 100M; } handlers { outdate-peer /usr/lib/heartbeat/drbd-peer-outdater; } } resource gcl { protocol C; startup { degr-wfc-timeout 120; } disk { on-io-error pass_on; fencing resource-only; } on apache-01 { device /dev/drbd0; disk /dev/sda3; address172.17.0.1:7788; meta-disk internal; } on apache-02 { device /dev/drbd0; disk /dev/sda3; address172.17.0.2:7788; meta-disk internal; } } /var/cfengine/inputs/update: ... if [ -x /sbin/drbdsetup ]; then chown root:haclient /sbin/drbdsetup /sbin/drbdmeta chmod 750 /sbin/drbdsetup /sbin/drbdmeta chmod u+s /sbin/drbdsetup /sbin/drbdmeta fi Thomas #!/bin/bash # . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs export PID=/var/run/apache2.pid export EXECUTABLE=/usr/sbin/apache2ctl case $1 in start) ${EXECUTABLE} start exit || exit 1; ;; stop) ${EXECUTABLE} stop exit || exit 1; ;; status) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit; fi exit 1; ;; monitor) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null { wget -o /dev/null -O /dev/null -T 1 -t 1 http://localhost/ exit || exit 1 } fi exit 7; ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=apachetg version1.0/version longdesc lang=en OCF Ressource Agent for Apache. /longdesc shortdesc lang=enOCF Ressource Agent for Apache./shortdesc actions action name=start timeout=90 / action name=stop timeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END ;; esac #!/bin/sh # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # OCF Ressource Agent on top of tomcat init script shipped with debian. # # Thomas Glanzmann --tg 21:22 07-12-30 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # This script manages a Heartbeat Tomcat instance # usage: $0 {start|stop|status|monitor|meta-data} # OCF exit codes are defined via ocf-shellfuncs . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs case $1 in start) /etc/init.d/tomcat5.5 start /dev/null 21 exit || exit 1 ;; stop) /etc/init.d/tomcat5.5 stop /dev/null 21 exit || exit 1 ;; status) /etc/init.d/tomcat5.5 status /dev/null 21 exit || exit 1 ;; monitor) # Check if Ressource is stopped /etc/init.d/tomcat5.5 status /dev/null 21 || exit 7 # Otherwise check services (XXX: Maybe loosen retry / timeout) wget -o /dev/null -O /dev/null -T 1 -t 1 http://localhost:8180/eccar/ exit || exit 1 ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=tomcattg version1.0/version longdesc lang=en OCF Ressource Agent on top of tomcat init script shipped with debian. /longdesc shortdesc lang=enOCF Ressource Agent on top of tomcat init script shipped with debian./shortdesc actions action name=start timeout=90 / action name=stoptimeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END
Re: [Linux-HA] Colocations and orders
Hello Jochen, ipaddr drbd filesystem (for mounting drbd) apache tomcat find a cib.xml attached. I also attached two resource agents that I wrote myself and run on Debian Etch. Adapt for your need (hostnames and ip address). I hope that gets you going. Thomas ub-freiburg.xml Description: XML document #!/bin/bash # . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs export PID=/var/run/apache2.pid export EXECUTABLE=/usr/sbin/apache2ctl case $1 in start) ${EXECUTABLE} start exit || exit 1; ;; stop) ${EXECUTABLE} stop exit || exit 1; ;; status) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null exit; fi exit 1; ;; monitor) if [ -f ${PID} ]; then kill -0 `cat ${PID}` /dev/null { wget -o /dev/null -O /dev/null -T 1 -t 1 http://localhost/ exit || exit 1 } fi exit 7; ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=apachetg version1.0/version longdesc lang=en OCF Ressource Agent for Apache. /longdesc shortdesc lang=enOCF Ressource Agent for Apache./shortdesc actions action name=start timeout=90 / action name=stop timeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END ;; esac #!/bin/sh # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # OCF Ressource Agent on top of tomcat init script shipped with debian. # # Thomas Glanzmann --tg 21:22 07-12-30 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # This script manages a Heartbeat Tomcat instance # usage: $0 {start|stop|status|monitor|meta-data} # OCF exit codes are defined via ocf-shellfuncs . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs case $1 in start) /etc/init.d/tomcat5.5 start /dev/null 21 exit || exit 1 ;; stop) /etc/init.d/tomcat5.5 stop /dev/null 21 exit || exit 1 ;; status) /etc/init.d/tomcat5.5 status /dev/null 21 exit || exit 1 ;; monitor) # Check if Ressource is stopped /etc/init.d/tomcat5.5 status /dev/null 21 || exit 7 # Otherwise check services (XXX: Maybe loosen retry / timeout) wget -o /dev/null -O /dev/null -T 1 -t 1 http://localhost:8180/eccar/ exit || exit 1 ;; meta-data) cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=tomcattg version1.0/version longdesc lang=en OCF Ressource Agent on top of tomcat init script shipped with debian. /longdesc shortdesc lang=enOCF Ressource Agent on top of tomcat init script shipped with debian./shortdesc actions action name=start timeout=90 / action name=stoptimeout=100 / action name=status timeout=60 / action name=monitor depth=0 timeout=30s interval=10s start-delay=10s / action name=meta-data timeout=5s / action name=validate-all timeout=20s / /actions /resource-agent END ;; esac ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] external/ipmi example configuration
Hello, the previous extern/ipmi configuration worked, but I don't know why. However here is one that seems to be follow standard practice: resources primitive id=postgres-01-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=postgres-01-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=postgres-01-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes id=postgres-01-fencing-ia attributes nvpair id=postgres-01-fencing-hostname name=hostname value=postgres-01/ nvpair id=postgres-01-fencing-ipaddr name=ipaddr value=172.18.0.121/ nvpair id=postgres-01-fencing-userid name=userid value=Administrator/ nvpair id=postgres-01-fencing-passwd name=passwd value=password/ /attributes /instance_attributes /primitive primitive id=postgres-02-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=postgres-02-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=postgres-02-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes id=postgres-02-fencing-ia attributes nvpair id=postgres-02-fencing-hostname name=hostname value=postgres-02/ nvpair id=postgres-02-fencing-ipaddr name=ipaddr value=172.18.0.122/ nvpair id=postgres-02-fencing-userid name=userid value=Administrator/ nvpair id=postgres-02-fencing-passwd name=passwd value=password/ /attributes /instance_attributes /primitive /resources constraints rsc_location id=postgres-01-fencing-placement rsc=postgres-01-fencing rule id=postgres-01-fencing-placement-rule-1 score=-INFINITY expression id=postgres-01-fencing-placement-exp-02 value=postgres-02 attribute=#uname operation=ne/ /rule /rsc_location rsc_location id=postgres-02-fencing-placement rsc=postgres-02-fencing rule id=postgres-02-fencing-placement-rule-1 score=-INFINITY expression id=postgres-02-fencing-placement-exp-02 value=postgres-01 attribute=#uname operation=ne/ /rule /rsc_location /constraints Last updated: Wed Jan 16 14:35:50 2008 Current DC: postgres-02 (211523e0-a549-49b7-bf29-f646915698ef) 2 Nodes configured. 4 Resources configured. Node: postgres-02 (211523e0-a549-49b7-bf29-f646915698ef): online Node: postgres-01 (24a3fa1b-6b62-470c-a6e1-4c1598875018): online Full list of resources: Master/Slave Set: ms-drbd0 drbd0:0 (heartbeat::ocf:drbd): Master postgres-02 drbd0:1 (heartbeat::ocf:drbd): Started postgres-01 Resource Group: postgres-cluster fs0 (heartbeat::ocf:Filesystem):Started postgres-02 ip0 (heartbeat::ocf:IPaddr2): Started postgres-02 pgsql0 (heartbeat::ocf:pgsql): Started postgres-02 postgres-01-fencing (stonith:external/ipmi):Started postgres-02 postgres-02-fencing (stonith:external/ipmi):Started postgres-01 Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] detecting network isolation
Hello, I have a two-node test cluster. I added a ping statement to each of the nodes to ping the default network. The two nodes are connected to the same network segment and have a crosslink cable between them. When I plug out the cable of the node that is running the service, I see the following in the logs, but the services are not migrated over to the node who has still good connection: Jan 17 05:50:56 ha-2 heartbeat: [4452]: WARN: node 10.0.0.1: is dead Jan 17 05:50:56 ha-2 heartbeat: [4452]: info: Link 10.0.0.1:10.0.0.1 dead. Jan 17 05:50:56 ha-2 crmd: [4470]: notice: crmd_ha_status_callback: Status update: Node 10.0.0.1 now has status [dead] Jan 17 05:50:56 ha-2 crmd: [4470]: WARN: get_uuid: Could not calculate UUID for 10.0.0.1 So what do I have to do to configure a 'failover on network isolation scenario'? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Automatic Clenaup of certain resources
Hello, I use Linux HA to monitor some services on a dial in machine. A so called single node lcuster. For example sometimes my dial-in connection or openvpn connection, or IPv6 connectivity does not come. Is there a way to tell Linux-HA to retry a failed resource after a certain amount of time again? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] detecting network isolation
Hello, Jan 17 05:50:56 ha-2 heartbeat: [4452]: WARN: node 10.0.0.1: is dead Jan 17 05:50:56 ha-2 heartbeat: [4452]: info: Link 10.0.0.1:10.0.0.1 dead. Jan 17 05:50:56 ha-2 crmd: [4470]: notice: crmd_ha_status_callback: Status update: Node 10.0.0.1 now has status [dead] Jan 17 05:50:56 ha-2 crmd: [4470]: WARN: get_uuid: Could not calculate UUID for 10.0.0.1 okay. Now I've got it. You need pingd in order to work that and set scores as described on that website: http://www.linux-ha.org/pingd Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Restart a Resource controlled by Heartbeat
Hello Boroczki, I'd rather use kill -HUP `pidof nagios` (or something similar) to reload the configuration of nagios. this is what I ended up doing. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Lars, Yes. You have more than one primitive within the clone, which doesn't work. Why do you do that? Because there is no documentation, the maintainer doesn't answer to e-mail and this was the only example that I found in the archives. And it seemed to work. But I guess I was just lucky. You could either clone a group, or just not clone the two; it's not needed. So could you please say that in plain xml? I still don't get it. Is that what you have in mind? - Don't use clone or group - One primitive per ipmi device - Location constraints primitive id=postgres-01-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=postgres-01-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=postgres-01-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes attributes nvpair id=postgres-01-fencing-hostname name=hostname value=postgres-01/ nvpair id=postgres-01-fencing-ipaddr name=ipaddr value=172.18.0.121/ nvpair id=postgres-01-fencing-userid name=userid value=Administrator/ nvpair id=postgres-01-fencing-passwd name=passwd value=password/ /attributes /instance_attributes /primitive primitive id=postgres-02-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=postgres-02-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=postgres-02-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes attributes nvpair id=postgres-02-fencing-hostname name=hostname value=postgres-02/ nvpair id=postgres-02-fencing-ipaddr name=ipaddr value=172.18.0.122/ nvpair id=postgres-02-fencing-userid name=userid value=Administrator/ nvpair id=postgres-02-fencing-passwd name=passwd value=password/ /attributes /instance_attributes /primitive constraints rsc_location id=postgres-01-fencing-placement rsc=postgres-01-fencing rule id=postgres-01-fencing-placement-1 score=INFINITY expression attribute=#uname operation=eq value=postgres-02/ /rule /rsc_location rsc_location id=postgres-02-fencing-placement rsc=postgres-02-fencing rule id=postgres-02-fencing-placement-2 score=INFINITY expression attribute=#uname operation=eq value=postgres-01/ /rule /rsc_location /constraints Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Running Linux-HA on a single node cluster
Hello, I have 9 machines configured as 6 clusters: ~ and I can't count. But I have a ninth server who does smtp. But it will soon go away and get a ha resource. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Running Linux-HA on a single node cluster
Hello Andrew, looks sane enough - though linux-ha is slightly heavy for just monitoring processes in a cluster-of-one. any reason not to make it a four node cluster? I have 9 machines configured as 6 clusters: - 2x apache (ha resources: router; openvpn; nagios; apache + mod jk; drbd + nfs; fencing) - 4x tomcat (ha resources: tomcat; one cluster per node) - 2x postgres (ha resources: drbd + postgres) I decided to keep them as different clusters so that they don't interfere with each other. If I would turn the tomcats in a four node cluster, how would it look like in xml? Four tomcats each with a strong affinity to a node? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Hello Andrew, does that help? yes it does. I have a test cluster. I will write a pseudo plugin or use the ssh one to simulate the behaviour and come back to you if I have something to work with. I am still not sure how it works, but maybe I simply should start to read source code. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.
Hello, could someone tell me what is wrong with that fencing configuration: Jan 13 11:38:48 apache-02 pengine: [13769]: ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned. Jan 13 11:38:48 apache-02 pengine: [13769]: info: process_pe_message: Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. (apache-01) [/var/adm/syslog/2008/01/13] / crm_verify -L -V crm_verify[5917]: 2008/01/13_11:44:57 ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned. crm_verify[5917]: 2008/01/13_11:44:57 WARN: native_assign_node: 2 nodes with equal score (+INFINITY) for running the listed resources (chose apache-02): crm_verify[5917]: 2008/01/13_11:44:57 WARN: native_assign_node: 2 nodes with equal score (+INFINITY) for running the listed resources (chose apache-01): crm_verify[5917]: 2008/01/13_11:44:57 WARN: native_assign_node: 2 nodes with equal score (+INFINITY) for running the listed resources (chose apache-01): Errors found during check: config not valid Here is my current fencing policy. But everything seems to work. I use the 2.1.3 version: configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair name=stonith-enabled value=true id=stonith-enabled/ nvpair name=stonith-action value=reboot id=stonith-action/ /attributes /cluster_property_set /crm_config resources clone id=fencing instance_attributes id=ia-fencing-01 attributes nvpair id=fencing-01 name=clone_max value=2/ nvpair id=fencing-02 name=clone_node_max value=1/ /attributes /instance_attributes primitive id=apache-01-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=apache-01-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=apache-01-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes id=ia-apache-01-fencing attributes nvpair id=apache-01-fencing-hostname name=hostname value=apache-01/ nvpair id=apache-01-fencing-ipaddr name=ipaddr value=172.18.0.101/ nvpair id=apache-01-fencing-userid name=userid value=Administrator/ nvpair id=apache-01-fencing-passwd name=passwd value=password/ /attributes /instance_attributes /primitive primitive id=apache-02-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=apache-02-fencing-monitor name=monitor interval=60s timeout=20s prereq=nothing/ op id=apache-02-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes id=ia-apache-02-fencing attributes nvpair id=apache-02-fencing-hostname name=hostname value=apache-02/ nvpair id=apache-02-fencing-ipaddr name=ipaddr value=172.18.0.102/ nvpair id=apache-02-fencing-userid name=userid value=Administrator/ nvpair id=apache-02-fencing-passwd name=passwd value=password/ /attributes /instance_attributes /primitive /clone Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] MailTo Resource specified wrong?
Hello Kirby, WARNING: Don't stat/monitor me! MailTo is a pseudo resource agent, so the status reported may be incorrect I guess if I had to guess, I'd probably delete the 'MailTo_6_mon' line... But I don't know if that'll affect the mail I get when heartbeat switches things around If you have a look at the ressource agent: /usr/lib/ocf/resource.d/heartbeat/MailTo You see that the status/monitor section is not needed to notify you. You may delete that monitor operation. However if I look at the MailTo RA that comes with version 2.1.3 I see that the warning message is commented out: MailToStatus () { # ocf_log warn Don't stat/monitor me! MailTo is a pseudo resource agent, so the status reported may be incorrect if ha_pseudo_resource MailTo_${OCF_RESOURCE_INSTANCE} monitor then echo running return $OCF_SUCCESS else echo stopped return $OCF_NOT_RUNNING fi } So you have more than once choice (first one is the best): - Update to a recent version (2.1.3) - Kill the Operation Section including the Monitor Statement from your MailTo configuration - Go to the ressource agent and comment that warning out (as it is per defualt on more recent versions) - Leave everything as it is and live with the warnings. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello, honestly, i would not use this repository for my upgrades as - at least in the past - major changes have been introduced during the heartbeat 2.1.3 development. for example the constraints were heavily modified. I wouldn't use it for production either. But my point still stands this repository is hard to use if there isn't a ready to go apt line to work with. And for new users ... and I was a new user two weeks ago ... it just gets in your way (first impression counts). Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Monitoring Apache (v2.0.8)
Hello Alon, I would update to 2.1.3 (I am not sure if that is your problem). And make the interval for the monitor operation higher. At the moment it seems to be scheduled each second. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello Michael, http://www.ultramonkey.org/download/heartbeat/2.1.3/ which Debian Release do you use? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello Michael, etch. debian_version is 4.0 apt-get update and upgrade done. the packages you try to use are for Debian Sid. You can do of the following things: - put deb http://131.188.30.102/~sithglan/linux-ha-die2te/ ./ into /etc/apt/sources.list and call apt-get update; apt-get install heartbeat - Build the packages by yourself: wget http://www.ultramonkey.org/download/heartbeat/2.1.3/debian-sid/heartbeat_2.1.3-2.dsc wget http://www.ultramonkey.org/download/heartbeat/2.1.3/debian-sid/heartbeat_2.1.3-2.diff.gz wget http://www.ultramonkey.org/download/heartbeat/2.1.3/debian-sid/heartbeat_2.1.3.orig.tar.gz dpkg-source -x heartbeat_2.1.3-2.dsc cd heartbeat-2.1.3/ fakeroot debian/rules binary And put them into a directory. I always use the following Makefile to create a debian package repository: TARGET=. all: @touch Release @apt-ftparchive packages $(TARGET) | tee $(TARGET)/Packages | gzip -c $(TARGET)/Packages.gz Remeber before the '@'s are tabs not spaces. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello, Is this a regularly updated repository with the heartbeat ldirectord packages (and only those packages)? yes, it is. But in the future the path will be deb http://131.188.30.102/~sithglan/ha/ Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello, btw. the problem was that I build the packages on machine that had a sarge gnutls-dev installed. I upgraded the package and just rolled it out on 9 machines everything is up and running. :-) Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello Andrew, http://download.opensuse.org/repositories/server:/ha-clustering/ do you have a apt line to use that location? I tried to make something up but failed. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] auto_failback off, but the resource group still fails back.
Jason, just to get sure that we're on the same page here: - You have a two node cluster - You have a resource that is running only on one node - When you run the resource on node b, node a reports for that resource a failed monitor? If that is the case then something is horrible wrong. Because the monitor operation for a resource should only run on a node that is currently running the resource. What I thought before is the following: You start your resource, heartbeat tries to start your resource on node a, the monitor that heartbeat starts right after it tried to start the resource to verify that is everything all right, fails. So heartbeat decides to run the resource on node b, calls monitor to verify afterwards and everything is fine. And you end up with the failed monitor on node a reported in crm_mon -1 -r. So might it be possible that your monitor action _always_ fails on node a even if the service is correctly started on that node? Try to start your service (with heartbeat stopped on both machines) and then try to call your resouce agent with the monitor argument and see if it does the right thing or does not. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: problems with ha, drbd and filesystems resource
Hello Stephan, No ideas about the problem? I think the question was already answered by someone on the list. Heartbeat doesn't support drbd-0.8 at the moment. Eg. you can run a primary/secondary cluster but not a primary/primary cluster. So one who understand what he is doing has to adopt the drbd ressource agent to understand the primary/primary scenario. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] auto_failback off, but the resource group still fails back.
Hello Jason, 1) It nominates a node as DC (in this case, node2, though I've seen both) 2) The 'failed actions' block get's these lines almost immediately: resource_samba_storage_monitor_0 (node=node2.domain.com, call=3, rc=9): Error resource_samba_storage_monitor_0 (node=node1.domain.com, call=3, rc=9): Error 3) then the resource group starts in order on node2. (IP, then storage, then daemon) strange. I have the picture now, but I am still unsure where it is coming from. Can you update to version 2.1.3 and try again with the exact same configuration? Could also send me the syslog of the daemon facility of one or both nodes (if possible only from a complete restart (hardbeat stopped on both nodes and then start them both ...) till the problem pops up)? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] debian and heartbeat
Hello, http://download.opensuse.org/repositories/server:/ha-clustering/Debian_Etch/Packages the debian folks are good, but not quite that good..see: http://ccrma.stanford.edu/planetccrma/man/man5/sources.list.5.html for details on how to setup a custom apt source. I read the manpage. I am still looking for a line that I can put in my /etc/apt/sources.list . Does someone has such a line? Does someone use that repository? If that is the case could that one be so kind to post simply that apt line and maybe publish it on the Download Heartbeat page. So if a new heartbeat user comes along he just adds that line to his /etc/apt/sources.list and type: apt-get update apt-get install heartbeat And has a up2date heatbeat package? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Can I use different interfaces in different nodes?
Hello, I want to setup a two nodes httpd cluster with heartbeat, and the configuration listed below: that shouldn't be a problem, just adopt the ha.cf on each node to reflect the network card configuration. And one more question, can I use bcast in VLAN environment? You can. I have it running on one: (postgres-01) [~] cat /etc/ha.d/ha.cf use_logd yes bcast eth1 mcast eth0.2 239.0.0.2 694 1 0 node postgres-01 postgres-02 respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd uid=hacluster gid=haclient crm on eth0.2 is a tagged VLAN port which I use using a multicast statement. But broadcast is also possible. I use multicast because I have 6 heartbeat clusters on that subnet. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] hb2: making xml manageable
Hello, 1. Without restarting or shutting down the cluster, and without editing the cib.xml file how can I make a change to the cluster configuration (i.e. how can I use haresources2cib.py to generate an updated cib.xml and get the cluster to use it without a restart) I use 8 space wide indenting with that it gets very readable. You can use cibadmin -Q to dump the configuration. Kick out the cib tags and the status section than you have a template to work with. Get sure that you have a unique identifier specified for each section. I use cibadmin -U -x /path/to/file.xml to update my configuration which works quiet well as long as you don't put resources into a resource group that were previous outside or vice versa. If you want a fres start you can always bring all cluster nodes down call rm /var/lib/heartbeat/crm/*, bring them up again and call the above command to get things going again. 2. How can I override the defaults (such as timeout) in the resulting cib.xml file? You just add a parameter. And call cibadmin -U -x /path/to/file. Here is my cheat sheet you may find it helpful: # linux ha # Status: crm_mon -1 -r # Dump XML Tree cibadmin -Q # Add a single resource cibadmin -o resources -C -x 01_drbd # Update a single resource cibadmin -o resources -U -x 02_filesystem # Add a single constraint cibadmin -o constraints -C -x 03_constraint_run_on # Update using a input produced from 'cibadmin -Q' minus crm tags and without the status section cibadmin -U -x postgres.xml # Use the cluster default to determine if a resource should get started crm_resource -r ms-drbd0 -v '#default' --meta -p target_role # Migrate a Resource to a host: crm_resource -M -r postgres-cluster -H postgres-01 # A nice man page with many examples at the end man crm_resource # Check if a node is in standby crm_standby -G -U postgres-01 crm_mon -1 -r # Put node into standby mode crm_standby -U postgres-01 -v on # Make node active again (the two commands have the same effect) crm_standby -U postgres-01 -v off crm_standby -D -U postgres-01 # Cleanup (Retry to Start after manual intervention) Resource crm_resource -C -r tomcat-02 -H tomcat-02 # Remove a statemen by id if it happens to be twice in there (should be fixed upstream) cibadmin -o resources -D -X 'op id=0a71bc1a-b460-49bb-9d0d-2fe3ada169b9 name=monitor interval=60s timeout=120s start_delay=1m/' http://fghaas.wordpress.com/2007/10/04/checking-your-secondarys-integrity/ # Reload CIB completly # - KILL CIB and STATUS Tags including content # Wipe old content (Attention don't do that during production it gets your service down: cibadmin -E) cibadmin -U -x /path/to/profile.xml http://www.mail-archive.com/linux-ha@lists.linux-ha.org/msg03187.html http://www.linux-ha.org/HaNFS http://www.linux-ha.org/DRBD/NFS http://www.linux-ha.org/DRBD/HowTov2 drbdsetup /dev/drbd0 primary -o # Split Brain Manual Recovery # Primary Node: drbdadm connect all # Attach Split Brain Node as Secondary drbdadm -- --discard-my-data connect # Force Split Brain Secondary to be Primary: drbdadm -- --overwrite-data-of-peer primary all # Reload Configuration /etc/init.d/heartbeat reload drbdadm adjust all # Heartbeat Broadcast link went missing (Attention: All Services get stopped on the node) /etc/init.d/heartbeat restart http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/ # List Configured OCF Agents and to operations on them (backend command) /usr/lib/heartbeat/lrmadmin # XML Template generator: python /usr/lib/heartbeat/crm_primitive.py # Delete Cluster Configuration (Attention: Nor for production. The command has # to be issued on all nodes and all ha services must be stopped on all nodes # in the cluster) rm /var/lib/heartbeat/crm/* I attached you an example postgres cluster configuration. Thomas postgres.xml Description: XML document ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Howto list all available agents and there possible attributes
Hello Simon, I double checked and 2.1.3-2 does include both /usr/lib/stonith/plugins/external/ipmi and /usr/sbin/ciblint I can confirm this. I used your diff, dsc and orig file to build a package for Debian Etch (4.0). I am going to roll out the version tonight on my production cluster (9 nodes). Thanks for fixing the issues. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] coding bugfix for lib/plugins/stonith/ipmilan.c
Hello Dejan, Configuration is comparable to the external/ipmi. Just check the parameter names and adjust the stonith type. I see. So there is no need to touch ha.cf? Just add the ipmilan thing to the cib.xml and that's it? Thomas, if you could also do additional testing, that'd be great. I will, but the thing is that I have to test it into a production environment so I have to take it slow. I am happy that my production system does what it is supposed to be right now. :-) But on the weekends I can take two of four tomcats offline and give ipmilan a try. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] auto_failback off, but the resource group still fails back.
Hello Jason, 1) For the monitor action, I might suggest the docs be updated slightly. According to http://www.linux-ha.org/OCFResourceAgent, 0 for 'running', 7 for 'stopped', and anything else is valid, but indicates an error. I have modified my script to only return '1' on error. However, the same issue persists (Error in resource_samba_storage_monitor_0). can you check to run it manually. Your ressource agent doesn't seem to need any arguments so you should be able to do that: - stop heartbeat on node2 - Wait a while until you see the state again - Call /path/to/resource_samba_storage monito echo Okay || echo Failed I am pretty sure that this gives you a Failed. Than track it down and get sure that it returns Okay. 2) crm_resource -C does not seem to have any effect. I could not be looking in the right place though. Get sure that you didn't misspelled the name of the service and the name of the host. If you have your service running on node3 and call the cleanup. The failled monitor transaction has to vanish from the output of crm_mon -1 -r. 3) The cibadmin -Q results are attached. Thanks. Looks fine for me. What you could always try is the following it wipes all states for sure: - Stop heartbeat on both nodes - Call rm /var/lib/heartbeat/crm/* on both nodes (that wipes your linux ha config including any logged states) - Start heartbeat on both nodes - call cibadmin -U -X' configuration resources group id=group_samba primitive id=resource_samba_ip class=ocf type=IPaddr provider=heartbeat instance_attributes id=resource_samba_ip_instance_attrs attributes nvpair id=a572b727-1aaf-43b5-b2c3-826305b6d533 name=ip value=10.31.11.114/ nvpair id=3669be20-7782-432e-9504-ba65668186ca name=nic value=eth0/ nvpair id=14fbbd65-3019-43e9-8eb5-d9fb88deced2 name=cidr_netmask value=255.255.255.0/ nvpair id=941cbbab-490d-48e7-88a8-3d4d62ff79a5 name=broadcast value=10.31.11.255/ nvpair id=e44afc64-42cb-470e-b418-9a981991ee02 name=iflabel value=vtqfs/ /attributes /instance_attributes operations op name=monitor interval=60s timeout=120s start_delay=1m id=monitor-samba-ip/ /operations /primitive primitive id=resource_samba_storage class=lsb type=hb-vxvol provider=heartbeat operations op name=monitor interval=60s timeout=120s start_delay=1m id=monitor-samba-storage/ /operations /primitive primitive id=resource_samba_daemon class=lsb type=hb-samba provider=heartbeat operations op name=monitor interval=60s timeout=120s start_delay=1m id=monitor-samba-daemon/ /operations /primitive /group /resources /configuration ' Oh, and what I don't get at all is the following: In your CIB you did not had a monitor action defined at all. At least with 2.1.3 means that your resource isn't monitored at all with one exception: Right after it is started. Could it be possible that your samba_storage RA does fork of another process in the background and returns immediatly? If that is the case, adopt the RA to don't do that. It should only return when the service is running. And running means it should be in a state so that monitor returns 0. The Postgres RA is a good example for this, I guess. I added monitor operations to your configuration. Nevertheless you should upgrade to 2.1.3 as soon as possible. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover iscsi SAN
Hello Michael, could you please send me your ocf resource agent for ietd and the Output of cibadmin -Q without the status section. That is because I want to do such a setup by myself. Have you tested and initiators with that setup. I would like to use it with ESX Server Version 3.5. And would like to know if the ESX 3.5 server works when switching the service. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Howto list all available agents and there possible attributes
Hello Andrew, I believe it was considered too broken to continue shipping. None of us have the required hardware to test/fix/maintain the relevant code. I think you believe wrong. The external/ipmi plugin works out of the box and perfectly fine at least for me. Just the documentation is missing but once you get the idea how to configure it, it is straight forward. When you call the debian/rules file which is shipped with 2.1.3 the checklint (or whatever the tool is called) missing and the version number is wrong but the external/ipmi tool is packaged. When I call the dsc file from the website the first two things are okay but external/ipmi is missing. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover iscsi SAN
Hello Niels, My personal experience with ietd is that it really doesn't like to be stopped if it is in use. (i.e. kernel panics, kernel hangs etc.). I would do some carefull testing before trying to use this in a heartbeat environment. I just want a proof of concept nor a production system. But I also saw ietd panicing the ietd host system but when I tried to access a blockdevice using a raw device mapping from a VMware ESX 3.0.1 server. This was reproducable. But as soon as I have soemthing that works I will report back. I never had problems with starting and stopping though. Btw. did you compile ietd by yourself or did you use a distribution like for example SLES10 (that one ships a ietd). Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] drbd + ocfs2
Hello, I would like to know how to setup a drbd + ocfs2 installation with two masters? What ocf agent do I have to use for that? Has someone a working example configuration? I would like to use heartbeat-2.1.3. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Jan 2 23:25:01 postgres-02 tengine: [8736]: ERROR: te_graph_trigger: Transition failed: terminated
Hello Andrew, Thanks - the PE is now smart enough to at least filter out the duplicates :-) thanks a lot for getting rid of this annoying bug. :-) Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Howto list all available agents and there possible attributes
Hello Simon, Thanks, I'll look into this. Though I was under the impression that the ipmi module was broken. Has it been fixed? there are two ipmi modules: - ipmilan (a c implementation, that is not build by default) - external/ipmi (a shell script) The first one was indeed broken because it did not compile at the time 2.1.3 was released but I saw a simple patch on the list (two brackets were missing). The Shell implementation just works. At least in the tests I did (play dead fish in the water and wait 10 seconds - power cycled by the other node). And as you already found out was the Makefile in the external directory from an older revision. Sorry I should have mentioned it because I already found the problem. Anyway thanks for fixing this. I am going to pull the new ones and build a package. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] coding bugfix for lib/plugins/stonith/ipmilan.c
Hello, But it stops. If you have the machine with IPMI interface, could you test my patch? do you have a confugration for me. I have a machine with ipmi and the external/ipmi stonith works for me. If you can walk me through configuring ipmilan I can give it a spin. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problems with ha, drbd and filesystems resource
Hello Stephan, could you please attach your config? cibadmin -Q and drop the status section? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Linux-HA Service Monitoring
Hello Jayaprakash, I Place the new script in /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs and execute the following commands. hopefully you did not but from the output I can tell that you didn't. You put it where it belongs. If possible come to online, we discuss in detailed.. My id Yahoo/gmail/msn id : jp.aspm I don't do instant messaging. Except for E-Mail. :-) I'm using Fedora 7 and Heartbeat Version 2.1.2 I see. Okay the problem is that the Fedora 7 init script is horrible broken. Could you please send me /etc/init.d/squid from your machine via e-mail. Send it directly to me otherwise people are going to scream at us. :-) Than I write you an ocf agent that doesn't rely on the init script that is shipped with Fedora 7. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] external/ipmi example configuration
Hello Dominik, How can I test the stonith plugin eg. tell heartbeat to shoot someone? iptables -I INPUT -j DROP Okay. That is obvious. Play dead fish in the water. Lucky me that I don't have a serial heartbeat. Thanks. Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] external/ipmi example configuration
Hello Dejan, I searched the archive but looked for ipmi in the subject, but now that you mentioned it I searched for external stonith and I found an example. See http://linux-ha.org/ExternalStonithPlugins for an example. You can also search the archive of this list for more examples. I read that page over and over but did not get it. But now I think I get it. I need one primitive per IPMI device. That was the information that I missed. So this should do the job, shouldn't it? clone id=DoFencing instance_attributes attributes nvpair name=clone_max value=2/ nvpair name=clone_node_max value=1/ /attributes /instance_attributes primitive id=postgres-01-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=postgres-01-fencing-monitor name=monitor interval=5s timeout=20s prereq=nothing/ op id=postgres-01-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes attributes nvpair id=postgres-01-fencing-hostname name=hostname value=postgres-01/ nvpair id=postgres-01-fencing-ipaddr name=ipaddr value=172.18.0.121/ nvpair id=postgres-01-fencing-userid name=userid value=Administrator/ nvpair id=postgres-01-fencing-passwd name=passwd value=whatever/ /attributes /instance_attributes /primitive primitive id=postgres-02-fencing class=stonith type=external/ipmi provider=heartbeat operations op id=postgres-02-fencing-monitor name=monitor interval=5s timeout=20s prereq=nothing/ op id=postgres-02-fencing-start name=start timeout=20s prereq=nothing/ /operations instance_attributes attributes nvpair id=postgres-02-fencing-hostname name=hostname value=postgres-02/ nvpair id=postgres-02-fencing-ipaddr name=ipaddr value=172.18.0.122/ nvpair id=postgres-02-fencing-userid name=userid value=Administrator/ nvpair id=postgres-02-fencing-passwd name=passwd value=whatever/ /attributes /instance_attributes /primitive /clone How can I test the stonith plugin eg. tell heartbeat to shoot someone? Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems