Hi,
I've been happily running a cluster of eight SLES10 machines using the
standard SLES10 service pack 1 heartbeat-2.0.8-0.19 RPMs. But after
adding 2 more machines, I'm now running into problems with the clone
resources. (And I get the same behaviour using the latest
heartbeat-2.1.2-18.1 from opensuse build service).
The simplest testcase I can manage is to prepare a basic cluster of 10
nodes, add evms to ha.cf,
apiauth evms uid=root gid=haclient
respawn root /sbin/evmsd
then inject this clone definition into the CIB:
<clone id="evmscloneset" notify="true" globally_unique="false">
<instance_attributes id="evmscloneset">
<attributes>
<nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
</attributes>
</instance_attributes>
<primitive id="evmsclone" class="ocf" type="EvmsSCC"
provider="heartbeat"/>
</clone>
As expected, in no time at all, a bunch of evmsclone resources appear and
start themselves on each machine:
evmsclone:0 (heartbeat::ocf:EvmsSCC): Started xp1tbkec2
evmsclone:1 (heartbeat::ocf:EvmsSCC): Started xp1tbkec1
evmsclone:2 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb1
evmsclone:3 (heartbeat::ocf:EvmsSCC): Started xp1tbkea2
evmsclone:4 (heartbeat::ocf:EvmsSCC): Started xp1tbkea1
evmsclone:5 (heartbeat::ocf:EvmsSCC): Started xp1tdbma1
evmsclone:6 (heartbeat::ocf:EvmsSCC): Started xp1tfrea1
evmsclone:7 (heartbeat::ocf:EvmsSCC): Started xp1tdbma2
evmsclone:8 (heartbeat::ocf:EvmsSCC): Started xp1tfrea2
evmsclone:9 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb2
But, if I now bounce one node (or stop/start heartbeat), as soon as it
starts to come up again, I get this in the log from the DC
tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
evmsclone:2_start_0 (32) confirmed on xp1tbkeb2 (rc=0)
tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo action
34 fired and confirmed
tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo action
37 fired and confirmed
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 48: evmsclone:0_post_notify_start_0 on xp1tbkec2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 52: evmsclone:1_post_notify_start_0 on xp1tbkec1
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 56: evmsclone:3_post_notify_start_0 on xp1tbkeb1
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 60: evmsclone:4_post_notify_start_0 on xp1tbkea2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 64: evmsclone:5_post_notify_start_0 on xp1tbkea1
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 68: evmsclone:6_post_notify_start_0 on xp1tdbma2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 72: evmsclone:7_post_notify_start_0 on xp1tdbma1
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 76: evmsclone:8_post_notify_start_0 on xp1tfrea2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 80: evmsclone:9_post_notify_start_0 on xp1tfrea1
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating
action 83: evmsclone:2_post_notify_start_0 on xp1tbkeb2
crmd[7374]: 2007/11/08_10:04:46 info: do_lrm_rsc_op: Performing
op=evmsclone:9_notify_0 key=80:10:823cef09-7dbd-4819-baa8-f91910dd8f35)
lrmd[7371]: 2007/11/08_10:04:46 info: rsc:evmsclone:9: notify
crmd[7374]: 2007/11/08_10:04:46 info: process_lrm_event: LRM operation
evmsclone:9_notify_0 (call=8, rc=0) complete
tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
evmsclone:9_post_notify_start_0 (80) confirmed on xp1tfrea1 (rc=0)
tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action
evmsclone:1_post_notify_start_0 (52) confirmed on xp1tbkec1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:4_post_notify_start_0 (60) confirmed on xp1tbkea2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:0_post_notify_start_0 (48) confirmed on xp1tbkec2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:8_post_notify_start_0 (76) confirmed on xp1tfrea2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:3_post_notify_start_0 (56) confirmed on xp1tbkeb1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:7_post_notify_start_0 (72) confirmed on xp1tdbma1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:5_post_notify_start_0 (64) confirmed on xp1tbkea1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:6_post_notify_start_0 (68) confirmed on xp1tdbma2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action
evmsclone:2_post_notify_start_0 (83) confirmed on xp1tbkeb2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: te_pseudo_action: Pseudo action
38 fired and confirmed
tengine[7406]: 2007/11/08_10:04:47 info: run_graph: Transition 10:
(Complete=29, Pending=0, Fired=0, Skipped=0, Incomplete=0)
crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_IPC_MESSAGE origin=route_message ]
crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: All 10 cluster
nodes are eligible to run resources.
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tdbma2 is online
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tbkec1 is online
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tbkec2 is online
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tfrea1 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally
renamed evmsclone:0 on xp1tfrea1 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tbkeb1 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally
renamed evmsclone:0 on xp1tbkeb1 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tfrea2 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally
renamed evmsclone:0 on xp1tfrea2 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tbkeb2 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally
renamed evmsclone:0 on xp1tbkeb2 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tbkea1 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally
renamed evmsclone:0 on xp1tbkea1 to evmsclone:4
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tbkea2 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally
renamed evmsclone:0 on xp1tbkea2 to evmsclone:7
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node
xp1tdbma1 is online
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected
char: : (9)
With the last line repeated forever (until I kill things and clean out the
CIB and orphaned resources).
I think can sort of see what's going wrong - for whatever reason,
something wants to increment the clone number in "evmsclone:9", finds that
it's "9", so increments it to "0" and then moves left in the string to
increment the tens digit. Unfortunately, this doesn't exists, so it seems
to find the colon instead and complains (a lot!). Is there some way I can
get it to use double digits for these clones?, e.g. "evmsclone:00".
As a workaround, I can add clone_nodes="11" to the cloneset definition,
giving me resources evmsclone:0 to evmsclone:10, which seems to work, but
I'm not totally convinced. For some reason, evmsclone:10 runs in
preference to evmsclone:9, and no amount of stopping/starting seems to
cause problems (yet).
evmsclone:0 (heartbeat::ocf:EvmsSCC): Started xp1tbkec2
evmsclone:1 (heartbeat::ocf:EvmsSCC): Started xp1tbkec1
evmsclone:2 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb1
evmsclone:3 (heartbeat::ocf:EvmsSCC): Started xp1tbkea2
evmsclone:4 (heartbeat::ocf:EvmsSCC): Started xp1tbkea1
evmsclone:5 (heartbeat::ocf:EvmsSCC): Started xp1tdbma1
evmsclone:6 (heartbeat::ocf:EvmsSCC): Started xp1tfrea1
evmsclone:7 (heartbeat::ocf:EvmsSCC): Started xp1tdbma2
evmsclone:8 (heartbeat::ocf:EvmsSCC): Started xp1tfrea2
evmsclone:9 (heartbeat::ocf:EvmsSCC): Stopped
evmsclone:10 (heartbeat::ocf:EvmsSCC): Started xp1tbkeb2
--
Iain.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems