Hi,

I added a heuristic checking network status and help in network failure 
scenarios.

However, I still face the same problem as soon as I stop the services orderly 
in the node holding the qdisk master role or reboot it.

If I execute in master qdisk node:

# service rgmanager stop
# service clvmd stop
# service qdiskd stop
# service cman stop

As said by Red Hat, I get the quorum lost in the other node until get the 
master role (some seconds) and stop the services.

I'm managing that by adding a sleep after stopping qdiskd long enough for the 
other node to become master, and then stop cman.

I understand this is a bug.

My cluster.conf file:

<?xml version="1.0"?>
<cluster alias="clueng" config_version="13" name="clueng">
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="10"/>
        <clusternodes>
                <clusternode name="rmamseslab05" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="iLO_NODE1"/>
                                </method>
                                <method name="2">
                                        <device name="manual_fencing" 
nodename="rmamseslab05"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="rmamseslab07" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="iLO_NODE2"/>
                                </method>
                                <method name="2">
                                        <device name="manual_fencing" 
nodename="rmamseslab07"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <totem token="45000"/>
        <quorumd device="/dev/mapper/mpathquorump1" interval="5" 
status_file="/tmp/qdisk" tko="3" votes="1">
                <heuristic program="/usr/local/cmcluster/conf/admin/test_hb.sh" 
score="1" interval="3"/>
        </quorumd>
        <fencedevices>
                <fencedevice agent="fence_manual" name="manual_fencing"/>
                <fencedevice agent="fence_ilo" hostname="rbrmamseslab05" 
login="LANO" name="iLO_NODE1" passwd="**"/>
                <fencedevice agent="fence_ilo" hostname="rbrmamseslab07" 
login="LANO" name="iLO_NODE2" passwd="**"/>
        </fencedevices>
        <rm>
                <!-- Configuration of the resource group manager -->
                <failoverdomains>
                </failoverdomains>
                <service autostart="1" exclusive="0" max_restarts="1" 
name="pkg_test" recovery="restart" restart_expire_time="900">
                        <script file="/etc/cluster/pkg_test/startstop.sh" 
name="pkg_test"/>
                </service>
                <resources>
                    <nfsexport name="nfs_export"/>
                </resources>
        </rm>
</cluster>

Best regards,

Alfredo


________________________________
From: [email protected] 
[mailto:[email protected]] On Behalf Of Juan Ramon Martin Blanco
Sent: Tuesday, July 07, 2009 12:21 PM
To: linux clustering
Subject: Re: [Linux-cluster] cman + qdisk timeouts....


On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo 
<[email protected]<mailto:[email protected]>> wrote:

Hi,



I'm having what I think is a timeouts issue in my cluster.



I have a two node cluster using qdisk. Everytime the node that has the master 
role for qdisk becomes down (for failure or even stopping qdiskd manually), 
packages in the sane node are stopped because of the lack of quorum as the 
qdiskd becames unresponsive until second node becames master node and start 
working properly. Once qdiskd start working fine (usually 5-6 seconds) packages 
are started again.



I've read in the cluster manual section for "CMAN membership timeout value" and 
I think this is the case. I've used RHEL 5.3 and I thought this parameter is 
the token that I set much longer that needed:



<cluster alias="CLUSTER_ENG" config_version="75" name="CLUSTER_ENG">

        <totem token="50000"/>

...



        <quorumd device="/dev/mapper/mpathquorump1" interval="3" 
status_file="/tmp/qdisk" tko="3" votes="5" log_level="7" log_facility="local4"/>





Totem token is much more that double of qdisk timeout, so I guess it should be 
enough but everytime qdisk dies in the master node I get same result, services 
restarted in the sane node:



Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update 
(2/3)

Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update 
(3/3)

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update 
(4/3)

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master

Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing 
/etc/init.d/watchdog status

Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update 
(5/3)

Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update 
(6/3)

Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <info> Assuming master role



Message from sysl...@rmamseslab07 at Jun 15 16:11:53 ...

 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with quorum 
device

Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking 
activity

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Membership Change Event

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of 
service:Cluster_test_2

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of 
service:wdtcscript-rmamseslab05-ic

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of 
service:wdtcscript-rmamseslab07-ic

Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of 
service:Logical volume 1

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update 
(7/3)

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction notice 
for node 1

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill the 
node

Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming 
activity



I've just logged a case but... any idea????



Regards,
Hi!

Have you set two_node="0" in cman section?
Why don't you use any heuristics within the quorumd configuration? I.e: pinging 
a router...
Could you paste us your cluster.conf?

Greetings,
Juanra






Alfredo Moralejo
Business Platforms Engineering - OS Servers - UNIX Senior Specialist

F. Hoffmann-La Roche Ltd.

Global Informatics Group Infrastructure
Josefa Valcárcel, 40
28027 Madrid SPAIN

Phone: +34 91 305 97 87

[email protected]<mailto:[email protected]>

Confidentiality Note: This message is intended only for the use of the named 
recipient(s) and may contain confidential and/or proprietary information. If 
you are not the intended recipient, please contact the sender and delete this 
message. Any unauthorized use of the information contained in this message is 
prohibited.



--
Linux-cluster mailing list
[email protected]<mailto:[email protected]>
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to