Re: [Linux-HA] failover test and behavior

FG Mon, 10 Sep 2007 04:41:16 -0700

Dejan Muhamedagic a écrit :
> Hi,
>
> On Thu, Sep 06, 2007 at 06:47:16PM +0200, FG wrote:
>   
>> Hi,
>>
>> I use heartbeat 2.1.1 in an active/passive configuration.
>>
>> I'am testing differents failover and need some explanations:
>>
>> My node are castor (active) and pollux (standby).
>>
>> I'm testing the process failover with monitoring. My configuration use
>> default_stickiness = "200" and default_failure_stickiness ="-200" and as
>> constraint rsc_location castor with a score of "200".
>> With these options, i can have 5 process failures before all services
>> can failover to castor.
>>
>> It goes as a charm... :-)
>>
>> The score on castor  decrease from 1000 (4 resources x 200 +
>> score_constraint 200) to 0 and with the sixth failure, failover.
>> The scores after failover are: castor (-1000) and pollux (800).
>> [EMAIL PROTECTED] crm]# ptest -L -VVVVVVVVVVVVVVVVVVVVV 2>&1|grep assign
>> ptest[31985]: 2007/09/06_15:57:25 debug: debug5: do_calculations: assign
>> nodes to colors
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> IPaddr_147_210_36_7, Node[0] pollux: 800
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> IPaddr_147_210_36_7, Node[1] castor: -1000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
>> pollux to IPaddr_147_210_36_7
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> Filesystem_2, Node[0] pollux: 1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> Filesystem_2, Node[1] castor: -1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
>> pollux to Filesystem_2
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> cyrus-imapd_3, Node[0] pollux: 1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> cyrus-imapd_3, Node[1] castor: -1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
>> pollux to cyrus-imapd_3
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> saslauthd_4, Node[0] pollux: 1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> saslauthd_4, Node[1] castor: -1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
>> pollux to saslauthd_4
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> pingd-child:0, Node[0] castor: 1
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> pingd-child:0, Node[1] pollux: 0
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
>> castor to pingd-child:0
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> pingd-child:1, Node[0] pollux: 1
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Color
>> pingd-child:1, Node[1] castor: -1000000
>> ptest[31985]: 2007/09/06_15:57:25 debug: native_assign_node: Assigning
>> pollux to pingd-child:1
>>
>> Now to test, I unplug the network card on pollux. I thought then  to
>> have a new failover to the first node (castor) but nothing...
>> So i watch my score and my log
>>
>> [EMAIL PROTECTED] crm]# ptest -L -VVVVVVVVVVVVVVVVVVVVV 2>&1|grep assign
>> ptest[32467]: 2007/09/06_16:17:11 debug: debug5: do_calculations: assign
>> nodes to colors
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> IPaddr_147_210_36_7, Node[0] castor: -1000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> IPaddr_147_210_36_7, Node[1] pollux: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
>> for resource IPaddr_147_210_36_7 are unavailable, unclean or shutting down
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> Filesystem_2, Node[0] castor: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> Filesystem_2, Node[1] pollux: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
>> for resource Filesystem_2 are unavailable, unclean or shutting down
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> cyrus-imapd_3, Node[0] castor: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> cyrus-imapd_3, Node[1] pollux: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
>> for resource cyrus-imapd_3 are unavailable, unclean or shutting down
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> saslauthd_4, Node[0] castor: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> saslauthd_4, Node[1] pollux: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: All nodes
>> for resource saslauthd_4 are unavailable, unclean or shutting down
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> pingd-child:0, Node[0] castor: 1
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> pingd-child:0, Node[1] pollux: 0
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Assigning
>> castor to pingd-child:0
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> pingd-child:1, Node[0] pollux: 1
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Color
>> pingd-child:1, Node[1] castor: -1000000
>> ptest[32467]: 2007/09/06_16:17:12 debug: native_assign_node: Assigning
>> pollux to pingd-child:1
>>
>> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
>> IPaddr_147_210_36_7 cannot run anywhere
>> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
>> Filesystem_2 cannot run anywhere
>> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
>> cyrus-imapd_3 cannot run anywhere
>> pengine[20890]: 2007/09/06_16:00:23 WARN: native_color: Resource
>> saslauthd_4 cannot run anywhere
>>
>> Could someone explain me what's happening ? Is that split-brain ???
>>     
>
> Yes, it is.
>
>   
>> Because of pingd failed,and my rule to score="-INFINITY", i think scores
>> on pollux are logics, aren't it ? And finally we have the same score for
>> resources on the two nodes 
>> How can i avoid this behavior ?
>>     
>
> The cluster won't try to run the resources on a node which has
> negative score, i.e. one on which the resource failed too many
> times. That's your case it seems. Try to reset the failcount and
> see if that helps.
>
> Thanks.
>
> Dejan
>
>   
Ok, i see and so i try:


# crm_failcount -D -U castor -r theprocess

and the new score are:   pollux -> 800
                                       castor ->
200 ((-1000) the old
score + 1200 the fail_stickiness value of theprocess)

and so I unplug eth0 on pollux, and i got a new
failover on castor.
(score: castor (1000), pollux (-INF))

Thanks a lot for yours explanations.

Fabrice


>> /
>> /I attach my settings (cibadmin -Q in a normal state), would you please
>> help to verify it ?
>>
>> Thanks, regards
>>
>> Fabrice
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>     
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>   

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] failover test and behavior

Reply via email to