Hi,

I've been happily running a cluster of eight SLES10 machines using the 
standard SLES10 service pack 1 heartbeat-2.0.8-0.19 RPMs.  But after 
adding 2 more machines, I'm now running into problems with the clone 
resources.  (And I get the same behaviour using the latest 
heartbeat-2.1.2-18.1 from opensuse build service).

The simplest testcase I can manage is to prepare a basic cluster of 10 
nodes, add evms to ha.cf,

  apiauth evms uid=root gid=haclient 
  respawn root /sbin/evmsd

then inject this clone definition into the CIB:

   <clone id="evmscloneset" notify="true" globally_unique="false">
     <instance_attributes id="evmscloneset"> 
       <attributes>
         <nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
       </attributes>
     </instance_attributes>
     <primitive id="evmsclone" class="ocf" type="EvmsSCC" 
provider="heartbeat"/> 
   </clone>

As expected, in no time at all, a bunch of evmsclone resources appear and 
start themselves on each machine:

    evmsclone:0 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec2
    evmsclone:1 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec1 
    evmsclone:2 (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb1
    evmsclone:3 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea2
    evmsclone:4 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea1
    evmsclone:5 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma1 
    evmsclone:6 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea1
    evmsclone:7 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma2
    evmsclone:8 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea2
    evmsclone:9 (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb2 

But, if I now bounce one node (or stop/start heartbeat), as soon as it 
starts to come up again, I get this in the log from the DC

tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action 
evmsclone:2_start_0 (32) confirmed on xp1tbkeb2 (rc=0) 
tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo action 
34 fired and confirmed
tengine[7406]: 2007/11/08_10:04:46 info: te_pseudo_action: Pseudo action 
37 fired and confirmed
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 48: evmsclone:0_post_notify_start_0 on xp1tbkec2 
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 52: evmsclone:1_post_notify_start_0 on xp1tbkec1
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 56: evmsclone:3_post_notify_start_0 on xp1tbkeb1 
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 60: evmsclone:4_post_notify_start_0 on xp1tbkea2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 64: evmsclone:5_post_notify_start_0 on xp1tbkea1 
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 68: evmsclone:6_post_notify_start_0 on xp1tdbma2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 72: evmsclone:7_post_notify_start_0 on xp1tdbma1 
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 76: evmsclone:8_post_notify_start_0 on xp1tfrea2
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 80: evmsclone:9_post_notify_start_0 on xp1tfrea1 
tengine[7406]: 2007/11/08_10:04:46 info: send_rsc_command: Initiating 
action 83: evmsclone:2_post_notify_start_0 on xp1tbkeb2
crmd[7374]: 2007/11/08_10:04:46 info: do_lrm_rsc_op: Performing 
op=evmsclone:9_notify_0 key=80:10:823cef09-7dbd-4819-baa8-f91910dd8f35) 
lrmd[7371]: 2007/11/08_10:04:46 info: rsc:evmsclone:9: notify
crmd[7374]: 2007/11/08_10:04:46 info: process_lrm_event: LRM operation 
evmsclone:9_notify_0 (call=8, rc=0) complete
tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action 
evmsclone:9_post_notify_start_0 (80) confirmed on xp1tfrea1 (rc=0) 
tengine[7406]: 2007/11/08_10:04:46 info: match_graph_event: Action 
evmsclone:1_post_notify_start_0 (52) confirmed on xp1tbkec1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:4_post_notify_start_0 (60) confirmed on xp1tbkea2 (rc=0) 
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:0_post_notify_start_0 (48) confirmed on xp1tbkec2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:8_post_notify_start_0 (76) confirmed on xp1tfrea2 (rc=0) 
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:3_post_notify_start_0 (56) confirmed on xp1tbkeb1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:7_post_notify_start_0 (72) confirmed on xp1tdbma1 (rc=0) 
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:5_post_notify_start_0 (64) confirmed on xp1tbkea1 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:6_post_notify_start_0 (68) confirmed on xp1tdbma2 (rc=0) 
tengine[7406]: 2007/11/08_10:04:47 info: match_graph_event: Action 
evmsclone:2_post_notify_start_0 (83) confirmed on xp1tbkeb2 (rc=0)
tengine[7406]: 2007/11/08_10:04:47 info: te_pseudo_action: Pseudo action 
38 fired and confirmed 
tengine[7406]: 2007/11/08_10:04:47 info: run_graph: Transition 10: 
(Complete=29, Pending=0, Fired=0, Skipped=0, Incomplete=0)
crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_IPC_MESSAGE origin=route_message ] 
crmd[7374]: 2007/11/08_10:04:47 info: do_state_transition: All 10 cluster 
nodes are eligible to run resources.
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tdbma2 is online
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tbkec1 is online 
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tbkec2 is online
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tfrea1 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally 
renamed evmsclone:0 on xp1tfrea1 to evmsclone:2 
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tbkeb1 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally 
renamed evmsclone:0 on xp1tbkeb1 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tfrea2 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally 
renamed evmsclone:0 on xp1tfrea2 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tbkeb2 is online 
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally 
renamed evmsclone:0 on xp1tbkeb2 to evmsclone:2
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tbkea1 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally 
renamed evmsclone:0 on xp1tbkea1 to evmsclone:4
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tbkea2 is online
pengine[7407]: 2007/11/08_10:04:47 info: unpack_find_resource: Internally 
renamed evmsclone:0 on xp1tbkea2 to evmsclone:7 
pengine[7407]: 2007/11/08_10:04:47 info: determine_online_status: Node 
xp1tdbma1 is online
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9) 
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9) 
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9) 
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)
pengine[7407]: 2007/11/08_10:04:47 ERROR: increment_clone: Unexpected 
char: : (9)

With the last line repeated forever (until I kill things and clean out the 
CIB and orphaned resources). 

I think can sort of see what's going wrong - for whatever reason, 
something wants to increment the clone number in "evmsclone:9", finds that 
it's "9", so increments it to "0" and then moves left in the string to 
increment the tens digit.  Unfortunately, this doesn't exists, so it seems 
to find the colon instead and complains (a lot!).  Is there some way I can 
get it to use double digits for these clones?, e.g. "evmsclone:00".

As a workaround, I can add clone_nodes="11" to the cloneset definition, 
giving me resources evmsclone:0 to evmsclone:10, which seems to work, but 
I'm not totally convinced.  For some reason, evmsclone:10 runs in 
preference to evmsclone:9, and no amount of stopping/starting seems to 
cause problems (yet). 

    evmsclone:0 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec2
    evmsclone:1 (heartbeat::ocf:EvmsSCC):       Started xp1tbkec1
    evmsclone:2 (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb1
    evmsclone:3 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea2 
    evmsclone:4 (heartbeat::ocf:EvmsSCC):       Started xp1tbkea1
    evmsclone:5 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma1
    evmsclone:6 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea1
    evmsclone:7 (heartbeat::ocf:EvmsSCC):       Started xp1tdbma2 
    evmsclone:8 (heartbeat::ocf:EvmsSCC):       Started xp1tfrea2
    evmsclone:9 (heartbeat::ocf:EvmsSCC):       Stopped 
    evmsclone:10        (heartbeat::ocf:EvmsSCC):       Started xp1tbkeb2


-- 
Iain. 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to