Hello
We have a corosync-pacemaker Cluster with two nodes (mutual takeover).
All resource locations on one node depending on the nodes network
connectivitiy. pingd-depending rules for the ip interfaces in a base -
group and then colocations for all other groups.
System:
Pacemaker 1.0.5-4.1 with heartbeat-3.0.0-33.2
OS: Scientific Linux release 5.1 (Red Hat)
Until yesterday the ping ocf ($OCF_ROOT/pacemaker/ping) was running ok.
The /var/log/.../messages file lists each call:
Apr 8 21:42:19 gfadb05 attrd_updater: [29208]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60
Apr 8 21:42:25 gfadb05 attrd_updater: [29233]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60
Apr 8 21:42:31 gfadb05 attrd_updater: [29315]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60
Then we had a strange problem: The kernel module for our external
storage made a rescan and the pacemaker configured devices where
different. The takeover failed and the cluster crashed. We solved the
device - problem, but then we had a completely different problem: The
$OCF_ROOT/pacemaker/ping OCF script can't calculate the value for our
"gateway_reachable" variable. The .../messages output was until now:
Apr 8 21:42:19 gfadb05 attrd_updater: [29208]: info: Invoked:
attrd_updater -n gateway_reachable -v -d 60
Apr 8 21:42:25 gfadb05 attrd_updater: [29233]: info: Invoked:
attrd_updater -n gateway_reachable -v -d 60
Apr 8 21:42:31 gfadb05 attrd_updater: [29315]: info: Invoked:
attrd_updater -n gateway_reachable -v -d 60
The attrd_updater misinterpreted the the -v parameter to the character
"-d" and not a numeric value!
The cibadmin -Q -o status: (the ping is configured as a clone -
resource)
<lrm_resource id="pingd_gfadb_04_05_gateway_reachable:0" type="ping"
class="ocf" provider="pacemaker">
<lrm_rsc_op id="pingd_gfadb_04_05_gateway_reachable:0_monitor_0"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="23:60:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;23:60:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="100" rc-code="0" op-status="0" interval="0"
last-run="1270797464" last-rc-change="1270797464" exec-time="1030"
queue-time="0" op-digest="edb8d3f7f7ce26e41a6b480cd270b8a6"/>
<lrm_rsc_op id="pingd_gfadb_04_05_gateway_reachable:0_monitor_5000"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="80:61:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;80:61:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="101" rc-code="0" op-status="0" interval="5000"
last-run="1270797465" last-rc-change="1270797465" exec-time="1020"
queue-time="0" op-digest="30d77769dbf037ed6c3c9dc19c035a5e"/>
<nvpair
id="status-2f8438aa-c106-48c7-8422-2473bbf3edfd-gateway_reachable"
name="gateway_reachable" value="-d"/>
<lrm_resource id="pingd_gfadb_04_05_gateway_reachable:1" type="ping"
class="ocf" provider="pacemaker">
<lrm_rsc_op id="pingd_gfadb_04_05_gateway_reachable:1_monitor_0"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="24:58:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;24:58:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="78" rc-code="0" op-status="0" interval="0"
last-run="1270797454" last-rc-change="1270797454" exec-time="1020"
queue-time="0" op-digest="edb8d3f7f7ce26e41a6b480cd270b8a6"/>
<lrm_rsc_op
id="pingd_gfadb_04_05_gateway_reachable:1_monitor_5000"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="82:59:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;82:59:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="79" rc-code="0" op-status="0" interval="5000"
last-run="1270797462" last-rc-change="1270797456" exec-time="1020"
queue-time="0" op-digest="30d77769dbf037ed6c3c9dc19c035a5e"/>
<nvpair
id="status-0f581a48-88e0-45ff-9beb-6895c7fab14d-gateway_reachable"
name="gateway_reachable" value="100"/>
The relevant section of the $OCF_ROOT/pacemaker/ping is:
221 ping_update() {
222 active=0
223 for host in $OCF_RESKEY_host_list; do
224 p_exe=ping
225
226 case `uname` in
227 Linux) p_args="-n -q -W $OCF_RESKEY_timeout -c
$OCF_RESKEY_attempts";;
228 Darwin) p_args="-n -q -t $OCF_RESKEY_timeout -c
$OCF_RESKEY_attempts -o";;
229 *) ocf_log err "Unknown host type: `uname`"; exit
$OCF_ERR_INSTALLED;;
230 esac
231
232 case $host in
233 *:*) p_exe=ping6
234 esac
235
236 p_out=`$p_exe $p_args $OCF_RESKEY_options $host 2>&1`; rc=$?
237
238 case $rc in
239 0) active=`expr $active + 1`;;
240 1) ocf_log debug "$host is inactive: $p_out";;
241 *) ocf_log err "Unexpected result for '$p_exe $p_args
$OCF_RESKEY_options $host' $rc: $p_out";;
242 esac
243 done
245 attrd_updater -n $OCF_RESKEY_name -v $score -d
$OCF_RESKEY_dampen
246
247 }
Because the "*" in line 244 is not escaped, the score will be an empty
string when $active is "0". We correct this to:
244 score=`expr $active \* $OCF_RESKEY_multiplier`
And the script was calculating the correct value and syslog shows
correctly:
Apr 8 21:42:19 gfadb05 attrd_updater: [29208]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60
Apr 8 21:42:25 gfadb05 attrd_updater: [29233]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60
Apr 8 21:42:31 gfadb05 attrd_updater: [29315]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60
Since this fix, the variable "gateway_reachable" containes the correct
value and the cluster behavior is like before.
Has anybody an idea what could happen?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems