RE: [Linux-HA] resources don't migrate when node is declared dead?!?

ZiLioN ZilLioN Tue, 29 Jul 2008 16:02:45 -0700

> Date: Tue, 29 Jul 2008 17:27:38 -0400
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: [Linux-HA] resources don't migrate when node is declared dead?!?
> 
> My cluster contains 2 active/passive nodes with one drbd master/slave
> resource and one group resource which itself contains 7 resources. I
> want the m/s and group to be colocated and when the master loose it's
> ping then the slave should be promoted but nothing happens when I
> pulled the ethernet cable... Here's what the constrains look like in
> the cib right now:
> 
>      <constraints>
>        <rsc_order id="drbd-before-group_id" from="group_id"
> action="start" to="ms-drbd_id" to_action="promote"/>

     action   from         type    to_action    to
     start     group_id   after    promote     ms-drbd_id

is OK.

>        <rsc_colocation id="group-on-drbd_id" to="ms-drbd_id"
> to_role="master" from="group_id" score="infinity"/>

Make rsc 'from' run on the same machine as rsc 'to'
group_id runs on the samen machine as ms-drbd_id with role master

is OK.

>        <rsc_location id="drbd_id:connected" rsc="ms-drbd_id">
>          <rule role="master" id="drbd_id:connected:rule" 
> score_attribute="pingd">
>            <expression id="drbd_id:connected-rule-1" attribute="pingd" 
> operation="defined"/>
>          </rule>

If your score_attribute="pingd", your pingd scaling factor is 100, then having 
access to one node is worth 100, 2 nodes is worth 200, and so on.
If you don´t have connectivity score_attribute=0.


>        </rsc_location>
>        <rsc_location id="cli-prefer-mysql_id" rsc="mysql_id">
>          <rule id="cli-prefer-rule-mysql_id" score="INFINITY">
>            <expression id="cli-prefer-expr-mysql_id" attribute="#uname" 
> operation="eq" value="feeble-1" type="string"/>
>          </rule>

If the group_id runs on the same machine as ms-drbd_id, you can not specify to 
mysql_id run always in feeble-1, because mysql_id belongs to the group 
group_id. Your group_id runs on the same machine as ms-drbd_id and ms-drbd_id 
runs in the node with the best connectivity.

>        </rsc_location>
>        <rsc_location id="cli-prefer-drbd_id:0" rsc="drbd_id:0">
>          <rule id="cli-prefer-rule-drbd_id:0" score="INFINITY">
>            <expression id="cli-prefer-expr-drbd_id:0"
> attribute="#uname" operation="eq" value="feeble-0" type="string"/>
>          </rule>
>        </rsc_location>
>        <rsc_location id="cli-prefer-drbd_id:1" rsc="drbd_id:1">
>          <rule id="cli-prefer-rule-drbd_id:1" score="INFINITY">
>            <expression id="cli-prefer-expr-drbd_id:1"
> attribute="#uname" operation="eq" value="feeble-0" type="string"/>
>          </rule>
>        </rsc_location>
>      </constraints>
> 
> 
> heartbeat ha.cf config file:
> 
> mcast eth0 239.0.0.1 694 1 0
> bcast eth1 
> deadping 20
> deadtime 10
> ping 132.206.178.1
> baud 115200
> serial /dev/ttyS0
> node feeble-0 feeble-1
> auto_failback off
> use_logd on
> respawn hacluster /usr/lib/heartbeat/dopd 
> apiauth dopd gid=haclient uid=hacluster
> respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s
> apiauth mgmtd uid=root
> respawn root /usr/lib/heartbeat/mgmtd -v
> 
> 
> After reconnecting i see in the ha.log
> 
> 
> heartbeat[2371]: 2008/07/29_16:10:12 info: Link 132.206.178.1:132.206.178.1 
> dead.
> pingd[2523]: 2008/07/29_16:10:12 notice: pingd_lstatus_callback: Status 
> update: Ping node 132.206.178.1 now has status [dead]
> pingd[2523]: 2008/07/29_16:10:12 notice: pingd_nstatus_callback: Status 
> update: Ping node 132.206.178.1 now has status [dead]
> pingd[2523]: 2008/07/29_16:10:12 info: send_update: 0 active ping nodes
> heartbeat[2371]: 2008/07/29_16:10:12 info: Link feeble-0:eth0 dead.
> pingd[2523]: 2008/07/29_16:10:12 notice: pingd_lstatus_callback: Status 
> update: Ping node feeble-0 now has status [dead]
> pingd[2523]: 2008/07/29_16:10:12 notice: pingd_nstatus_callback: Status 
> update: Ping node feeble-0 now has status [dead]
> attrd[2529]: 2008/07/29_16:10:17 info: attrd_trigger_update: Sending flush op 
> to all hosts for: pingd
> attrd[2529]: 2008/07/29_16:10:17 info: attrd_ha_callback: flush message from 
> feeble-1
> attrd[2529]: 2008/07/29_16:10:17 info: attrd_perform_update: Sent update 13: 
> pingd=0
> tengine[2543]: 2008/07/29_16:10:17 info: extract_event: Aborting on 
> transient_attributes changes for d7fb07f0-a857-446d-98e6-fce91c1b6094
> tengine[2543]: 2008/07/29_16:10:17 info: update_abort_priority: Abort 
> priority upgraded to 1000000
> tengine[2543]: 2008/07/29_16:10:17 info: te_update_diff: Aborting on 
> transient_attributes deletions
> crmd[2530]: 2008/07/29_16:10:17 info: do_state_transition: State transition 
> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE 
> origin=route_message ]
> crmd[2530]: 2008/07/29_16:10:17 info: do_state_transition: All 2 cluster 
> nodes are eligible to run resources.
> pengine[2544]: 2008/07/29_16:10:17 info: determine_online_status: Node 
> feeble-1 is online
> pengine[2544]: 2008/07/29_16:10:17 info: determine_online_status: Node 
> feeble-0 is online
> pengine[2544]: 2008/07/29_16:10:17 info: unpack_find_resource: Internally 
> renamed drbd_id:0 on feeble-0 to drbd_id:1
> pengine[2544]: 2008/07/29_16:10:17 notice: clone_print: Master/Slave Set: 
> ms-drbd_id
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     drbd_id:0  
> (ocf::heartbeat:drbd):  Master feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     drbd_id:1  
> (ocf::heartbeat:drbd):  Slave feeble-0
> pengine[2544]: 2008/07/29_16:10:17 notice: group_print: Resource Group: 
> group_id
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     fs_id      
> (ocf::heartbeat:Filesystem):    Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     nfs_kernel-id    
>   (ocf::bic:nfs-kernel):  Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     nfs_common-id    
>   (ocf::bic:nfs-common):  Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     ip_id      
> (ocf::heartbeat:IPaddr):        Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     mysql_id   
> (ocf::heartbeat:mysql): Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     apache_id  
> (ocf::heartbeat:apache):        Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 notice: native_print:     email_id   
> (ocf::heartbeat:MailTo):        Started feeble-1
> pengine[2544]: 2008/07/29_16:10:17 info: master_color: Promoting drbd_id:0 
> (Master feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 info: master_color: ms-drbd_id: Promoted 1 
> instances of a possible 1 to master
> pengine[2544]: 2008/07/29_16:10:17 info: master_color: ms-drbd_id: Promoted 1 
> instances of a possible 1 to master
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> drbd_id:0       (Master feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> drbd_id:1       (Slave feeble-0)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> drbd_id:0       (Master feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> drbd_id:1       (Slave feeble-0)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource fs_id 
>   (Started feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> nfs_kernel-id   (Started feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> nfs_common-id   (Started feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource ip_id 
>   (Started feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> mysql_id        (Started feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> apache_id       (Started feeble-1)
> pengine[2544]: 2008/07/29_16:10:17 notice: NoRoleChange: Leave resource 
> email_id        (Started feeble-1)
> crmd[2530]: 2008/07/29_16:10:17 info: do_state_transition: State transition 
> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
> cause=C_IPC_MESSAGE origin=route_messag
> e ]
> tengine[2543]: 2008/07/29_16:10:17 info: process_te_message: Processing graph 
> derived from /var/lib/heartbeat/pengine/pe-input-7.bz2
> tengine[2543]: 2008/07/29_16:10:17 info: unpack_graph: Unpacked transition 
> 29: 0 actions in 0 synapses
> tengine[2543]: 2008/07/29_16:10:17 info: run_graph: Transition 29: 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)
> tengine[2543]: 2008/07/29_16:10:17 info: notify_crmd: Transition 29 status: 
> te_complete - <null>
> crmd[2530]: 2008/07/29_16:10:17 info: do_state_transition: State transition 
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE 
> origin=route_message ]
> pengine[2544]: 2008/07/29_16:10:17 info: process_pe_message: Transition 29: 
> PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-7.bz2
> heartbeat[2371]: 2008/07/29_16:10:22 WARN: node 132.206.178.1: is dead
> crmd[2530]: 2008/07/29_16:10:22 notice: crmd_ha_status_callback: Status 
> update: Node 132.206.178.1 now has status [dead]
> pingd[2523]: 2008/07/29_16:10:22 notice: pingd_nstatus_callback: Status 
> update: Ping node 132.206.178.1 now has status [dead]
> pingd[2523]: 2008/07/29_16:10:22 info: send_update: 0 active ping nodes
> crmd[2530]: 2008/07/29_16:10:22 WARN: get_uuid: Could not calculate UUID for 
> 132.206.178.1
> crmd[2530]: 2008/07/29_16:10:22 info: crm_update_peer: Creating entry for 
> node 132.206.178.1/0/0
> crmd[2530]: 2008/07/29_16:10:22 ERROR: crm_abort: crm_update_peer: Triggered 
> assert at membership.c:161 : uuid != NULL
> crmd[2530]: 2008/07/29_16:10:22 ERROR: crm_abort: crm_update_peer_proc: 
> Triggered assert at membership.c:263 : node != NULL
> crmd[2530]: 2008/07/29_16:10:22 WARN: get_uuid: Could not calculate UUID for 
> 132.206.178.1
> heartbeat[2371]: 2008/07/29_16:14:50 info: Link feeble-0:eth0 up.
> pingd[2523]: 2008/07/29_16:14:50 notice: pingd_lstatus_callback: Status 
> update: Ping node feeble-0 now has status [up]
> pingd[2523]: 2008/07/29_16:14:50 notice: pingd_nstatus_callback: Status 
> update: Ping node feeble-0 now has status [up]
> heartbeat[2371]: 2008/07/29_16:14:51 info: Link 132.206.178.1:132.206.178.1 
> up.
> heartbeat[2371]: 2008/07/29_16:14:51 WARN: Late heartbeat: Node 
> 132.206.178.1: interval 290020 ms
> heartbeat[2371]: 2008/07/29_16:14:51 info: Status update for node 
> 132.206.178.1: status ping
> crmd[2530]: 2008/07/29_16:14:51 notice: crmd_ha_status_callback: Status 
> update: Node 132.206.178.1 now has status [ping]
> pingd[2523]: 2008/07/29_16:14:51 notice: pingd_lstatus_callback: Status 
> update: Ping node 132.206.178.1 now has status [up]
> pingd[2523]: 2008/07/29_16:14:51 notice: pingd_nstatus_callback: Status 
> update: Ping node 132.206.178.1 now has status [up]
> 
> 
> Must be something obvious but I can't find it.
> 
> There used to be a neat script to calculate scores and stickiness but
> I upgraded heartbeat to 2.1.3-18 with the deb packages on opensuse.org
> and with pacemaker-0.6.5-1 and now showscores.sh gives me 
> 
> ~# showscores.sh 
> Resource            Score     Node            Stickiness #Fail 
> Fail-Stickiness 
> -1000000_(master)   -INFINITY ptest[11262]:   100        -1001                
> 1000000_(master)    INFINITY  ptest[11262]:   100        -1001
> 100_(master)        100       ptest[11262]:   100        -1001
> 175_(master)        175       ptest[11262]:   100        -1001
> 76_(master)         76        ptest[11262]:   100        -1001

I don´t understand this, where is the name of resources? node?

> 
> Any ideas?

Before pulled the ethernet cable, try:

cibadmin -Q -o nodes

You can see the all the nodes of your cluster?
Today in my cluster only a node of the two was recognized.


I do not think you have helped a lot, even I do not have much
experience with the cib.xml. I hope that the comments at least you are
beneficial.

> 
> regards,
> jf
> -- 
> <° ><
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_________________________________________________________________
Hazte tu propia televisión a la carta. Música, noticias, estrenos, cine, humor 
y viajes en MSN Vídeo
http://video.msn.com/?mkt=es-es_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
RE: [Linux-HA] resources don't migrate when node is declared dead?!?

Reply via email to