Hello 

We have a corosync-pacemaker Cluster with two nodes (mutual takeover).
All resource locations on one node depending on the nodes network
connectivitiy. pingd-depending rules for the ip interfaces in a base -
group and then colocations for all other groups.
System:
Pacemaker 1.0.5-4.1 with heartbeat-3.0.0-33.2
OS: Scientific Linux release 5.1 (Red Hat)

Until yesterday the ping ocf ($OCF_ROOT/pacemaker/ping) was running ok.
The /var/log/.../messages file lists each call:
Apr  8 21:42:19 gfadb05 attrd_updater: [29208]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60 
Apr  8 21:42:25 gfadb05 attrd_updater: [29233]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60 
Apr  8 21:42:31 gfadb05 attrd_updater: [29315]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60 

Then we had a strange problem: The kernel module for our external
storage made a rescan and the pacemaker configured devices where
different. The takeover failed and the cluster crashed. We solved the
device - problem, but then we had a completely different problem: The
$OCF_ROOT/pacemaker/ping OCF script can't calculate the value for our
"gateway_reachable" variable. The .../messages output was until now:
Apr  8 21:42:19 gfadb05 attrd_updater: [29208]: info: Invoked:
attrd_updater -n gateway_reachable -v -d 60 
Apr  8 21:42:25 gfadb05 attrd_updater: [29233]: info: Invoked:
attrd_updater -n gateway_reachable -v -d 60 
Apr  8 21:42:31 gfadb05 attrd_updater: [29315]: info: Invoked:
attrd_updater -n gateway_reachable -v -d 60 

The attrd_updater misinterpreted the the -v parameter to the character
"-d" and not a numeric value! 
The cibadmin -Q -o status: (the ping is configured as a clone -
resource)

<lrm_resource id="pingd_gfadb_04_05_gateway_reachable:0" type="ping"
class="ocf" provider="pacemaker">
    <lrm_rsc_op id="pingd_gfadb_04_05_gateway_reachable:0_monitor_0"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="23:60:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;23:60:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="100" rc-code="0" op-status="0" interval="0"
last-run="1270797464" last-rc-change="1270797464" exec-time="1030"
queue-time="0" op-digest="edb8d3f7f7ce26e41a6b480cd270b8a6"/>
    <lrm_rsc_op id="pingd_gfadb_04_05_gateway_reachable:0_monitor_5000"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="80:61:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;80:61:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="101" rc-code="0" op-status="0" interval="5000"
last-run="1270797465" last-rc-change="1270797465" exec-time="1020"
queue-time="0" op-digest="30d77769dbf037ed6c3c9dc19c035a5e"/>
    <nvpair
id="status-2f8438aa-c106-48c7-8422-2473bbf3edfd-gateway_reachable"
name="gateway_reachable" value="-d"/>
    <lrm_resource id="pingd_gfadb_04_05_gateway_reachable:1" type="ping"
class="ocf" provider="pacemaker">
        <lrm_rsc_op id="pingd_gfadb_04_05_gateway_reachable:1_monitor_0"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="24:58:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;24:58:7:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="78" rc-code="0" op-status="0" interval="0"
last-run="1270797454" last-rc-change="1270797454" exec-time="1020"
queue-time="0" op-digest="edb8d3f7f7ce26e41a6b480cd270b8a6"/>
        <lrm_rsc_op
id="pingd_gfadb_04_05_gateway_reachable:1_monitor_5000"
operation="monitor" crm-debug-origin="build_active_RAs"
crm_feature_set="3.0.1"
transition-key="82:59:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
transition-magic="0:0;82:59:0:7f9abdfe-07d4-4874-a52c-0453b26c9ce1"
call-id="79" rc-code="0" op-status="0" interval="5000"
last-run="1270797462" last-rc-change="1270797456" exec-time="1020"
queue-time="0" op-digest="30d77769dbf037ed6c3c9dc19c035a5e"/>
        <nvpair
id="status-0f581a48-88e0-45ff-9beb-6895c7fab14d-gateway_reachable"
name="gateway_reachable" value="100"/>


The relevant section of the $OCF_ROOT/pacemaker/ping is:
221 ping_update() {
222     active=0
223     for host in $OCF_RESKEY_host_list; do
224         p_exe=ping
225
226         case `uname` in
227             Linux) p_args="-n -q -W $OCF_RESKEY_timeout -c
$OCF_RESKEY_attempts";;
228             Darwin) p_args="-n -q -t $OCF_RESKEY_timeout -c
$OCF_RESKEY_attempts -o";;
229             *) ocf_log err "Unknown host type: `uname`"; exit
$OCF_ERR_INSTALLED;;
230         esac
231
232         case $host in
233             *:*) p_exe=ping6
234         esac
235
236         p_out=`$p_exe $p_args $OCF_RESKEY_options $host 2>&1`; rc=$?
237
238         case $rc in
239             0) active=`expr $active + 1`;;
240             1) ocf_log debug "$host is inactive: $p_out";;
241             *) ocf_log err "Unexpected result for '$p_exe $p_args
$OCF_RESKEY_options $host' $rc: $p_out";;
242         esac
243     done
245     attrd_updater -n $OCF_RESKEY_name -v $score -d
$OCF_RESKEY_dampen
246     
247 }

Because the "*" in line 244 is not escaped, the score will be an empty
string when $active is "0". We correct this to:
244     score=`expr $active \* $OCF_RESKEY_multiplier` 
And the script was calculating the correct value and syslog shows
correctly:
Apr  8 21:42:19 gfadb05 attrd_updater: [29208]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60 
Apr  8 21:42:25 gfadb05 attrd_updater: [29233]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60 
Apr  8 21:42:31 gfadb05 attrd_updater: [29315]: info: Invoked:
attrd_updater -n gateway_reachable -v 100 -d 60 

Since this fix, the variable "gateway_reachable" containes the correct
value and the cluster behavior is like before.

Has anybody an idea what could happen?

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to