Ok. I'll try the fence_manual and change the clean_start to one. I will report you the results ASAP.
Thank you for the feedback. 2012/6/20 emmanuel segura <emi2f...@gmail.com> > Ok Javier > > So now i know you don't wanna the fencing and the reason :-) > > <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/> > > and use the fence_manual > > > > 2012/6/20 Javier Vela <jvdi...@gmail.com> > >> I don't use fencing because with ha-lvm I thought that I dind't need it. >> But also because both nodes are VMs in VMWare. I know that there is a >> module to do fencing with vmware but I prefer to avoid it. I'm not in >> control of the VMWare infraestructure and probably VMWare admins won't give >> me the tools to use this module. >> >> Regards, Javi >> >> >>> Fencing is critical, and running a cluster without fencing, even with >>> >>> >>> qdisk, is not supported. Manual fencing is also not supported. The >>> *only* way to have a reliable cluster, testing or production, is to use >>> fencing. >>> >>> Why do you not wish to use it? >>> >>> On 06/20/2012 09:43 AM, Javier Vela wrote: >>> >>> >>> > As I readed, if you use HA-LVM you don't need fencing because of vg >>> > tagging. Is It absolutely mandatory to use fencing with qdisk? >>> > >>> > If it is, i supose i can use manual_fence, but in production I also >>> >>> >>> > won't use fencing. >>> > >>> > Regards, Javi. >>> > >>> > Date: Wed, 20 Jun 2012 14:45:28 +0200 >>> > From: emi2f...@gmail.com <mailto:emi2f...@gmail.com> >>> >>> >>> > To: linux-cluster@redhat.com <mailto:linux-cluster@redhat.com> >>> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster >>> >>> >>> > >>> > If you don't wanna use a real fence divice, because you only do some >>> > test, you have to use fence_manual agent >>> > >>> > 2012/6/20 Javier Vela <jvdi...@gmail.com <mailto:jvdi...@gmail.com>> >>> >>> >>> > >>> > Hi, I have a very strange problem, and after searching through lot >>> > of forums, I haven't found the solution. This is the scenario: >>> > >>> > Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum >>> >>> >>> > disk. I start qdiskd, cman and rgmanager on one node. After 5 >>> > minutes, finally the fencing finishes and cluster get quorate with 2 >>> > votes: >>> > >>> > [root@node2 ~]# clustat >>> > Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012 >>> >>> >>> > Member Status: Quorate >>> > >>> > Member Name ID Status >>> > ------ ---- ---- ------ >>> > node1-hb 1 Offline >>> >>> >>> > node2-hb 2 Online, Local, rgmanager >>> > /dev/mapper/vg_qdisk-lv_qdisk 0 Online, Quorum Disk >>> > >>> > Service Name Owner (Last) State >>> >>> >>> > ------- ---- ----- ------ ----- >>> > service:postgres node2 started >>> > >>> > Now, I start the second node. When cman reaches fencing, it hangs >>> >>> >>> > for 5 minutes aprox, and finally fails. clustat says: >>> > >>> > root@node1 ~]# clustat >>> > Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012 >>> > Member Status: Inquorate >>> > >>> >>> >>> > Member Name ID Status >>> > ------ ---- ---- ------ >>> > node1-hb 1 Online, Local >>> > node2-hb 2 Offline >>> >>> >>> > /dev/mapper/vg_qdisk-lv_qdisk 0 Offline >>> > >>> > And in /var/log/messages I can see this errors: >>> > >>> > Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL >>> > state. >>> >>> >>> > Jun 20 06:02:12 node1 openais[6098]: [CLM ] got nodejoin message >>> > 15.15.2.10 >>> > Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, >>> > check ccsd or cluster status >>> >>> >>> > Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate >>> >>> >>> > Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, >>> > check ccsd or cluster status >>> > Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> >>> >>> > Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state >>> > from 9. >>> > Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> >>> >>> > connection. >>> > Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> >>> >>> > Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect: >>> >>> >>> > Connection refused >>> > Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> >>> >>> > Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> >>> >>> > connection. >>> > Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> >>> >>> > Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> > Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect: >>> >>> >>> > Connection refused >>> > Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect: >>> > Connection refused >>> >>> >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state >>> > from 0. >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token >>> > because I am the rep. >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id >>> >>> >>> > for ring 15c >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state. >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state. >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member >>> >>> >>> > 15.15.2.10 <http://15.15.2.10>: >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 >>> > rep 15.15.2.10 >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e >>> >>> >>> > received flag 1 >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to >>> > originate any messages in recovery. >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token >>> >>> >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL >>> > state. >>> > Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> > Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect: >>> >>> >>> > Connection refused >>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state >>> > from 9. >>> > Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing >>> > connection. >>> >>> >>> > >>> > And the quorum disk: >>> > >>> > [root@node2 ~]# mkqdisk -L -d >>> > kqdisk v0.6.0 >>> > /dev/mapper/vg_qdisk-lv_qdisk: >>> > /dev/vg_qdisk/lv_qdisk: >>> > Magic: eb7a62c2 >>> >>> >>> > Label: cluster_qdisk >>> > Created: Thu Jun 7 09:23:34 2012 >>> > Host: node1 >>> > Kernel Sector Size: 512 >>> >>> >>> > Recorded Sector Size: 512 >>> > >>> > Status block for node 1 >>> > Last updated by node 2 >>> > Last updated on Wed Jun 20 06:17:23 2012 >>> > State: Evicted >>> >>> >>> > Flags: 0000 >>> > Score: 0/0 >>> > Average Cycle speed: 0.000500 seconds >>> > Last Cycle speed: 0.000000 seconds >>> > Incarnation: 4fe1a06c4fe1a06c >>> >>> >>> > Status block for node 2 >>> > Last updated by node 2 >>> > Last updated on Wed Jun 20 07:09:38 2012 >>> > State: Master >>> > Flags: 0000 >>> > Score: 0/0 >>> >>> >>> > Average Cycle speed: 0.001000 seconds >>> > Last Cycle speed: 0.000000 seconds >>> > Incarnation: 4fe1a06c4fe1a06c >>> > >>> > >>> > In the other node I don't see any errors in /var/log/messages. One >>> >>> >>> > strange thing is that if I start cman on both nodes at the same >>> > time, everything works fine and both nodes quorate (until I reboot >>> > one node and the problem appears). I've checked that multicast is >>> >>> >>> > working properly. With iperf I can send a receive multicast paquets. >>> > Moreover I've seen with tcpdump the paquets that openais send when >>> > cman is trying to start. I've readed about a bug in RH 5.3 with the >>> >>> >>> > same behaviour, but it is solved in RH 5.4. >>> > >>> > I don't have Selinux enabled, and Iptables are also disabled. Here >>> > is the cluster.conf simplified (with less services and resources). I >>> >>> >>> > want to point out one thing. I have allow_kill="0" in order to avoid >>> > fencing errors when quorum tries to fence a failed node. As <fence/> >>> > is empty, before this stanza I got a lot of messages in >>> >>> >>> > /var/log/messages with failed fencing. >>> > >>> > <?xml version="1.0"?> >>> > <cluster alias="test_cluster" config_version="15" name="test_cluster"> >>> >>> >>> > <fence_daemon clean_start="0" post_fail_delay="0" >>> > post_join_delay="-1"/> >>> > <clusternodes> >>> > <clusternode name="node1-hb" nodeid="1" votes="1"> >>> >>> >>> > <fence/> >>> > </clusternode> >>> > <clusternode name="node2-hb" nodeid="2" votes="1"> >>> > <fence/> >>> >>> >>> > </clusternode> >>> > </clusternodes> >>> > <cman two_node="0" expected_votes="3"/> >>> > <fencedevices/> >>> >>> >>> > >>> > <rm log_facility="local4" log_level="7"> >>> > <failoverdomains> >>> > <failoverdomain name="etest_cluster_fo" >>> >>> >>> > nofailback="1" ordered="1" restricted="1"> >>> > <failoverdomainnode name="node1-hb" >>> > priority="1"/> >>> >>> >>> > <failoverdomainnode name="node2-hb" >>> > priority="2"/> >>> > </failoverdomain> >>> > </failoverdomains> >>> >>> >>> > <resources/> >>> > <service autostart="1" domain="test_cluster_fo" >>> > exclusive="0" name="postgres" recovery="relocate"> >>> >>> >>> > <ip address="172.24.119.44" monitor_link="1"/> >>> > <lvm name="vg_postgres" vg_name="vg_postgres" >>> > lv_name="postgres"/> >>> >>> >>> > >>> > <fs device="/dev/vg_postgres/postgres" >>> > force_fsck="1" force_unmount="1" fstype="ext3" >>> > mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/> >>> >>> >>> > >>> > <script file="/etc/init.d/postgresql" >>> > name="postgres"> >>> > </script> >>> > </service> >>> > </rm> >>> >>> >>> > <totem consensus="4000" join="60" token="20000" >>> > token_retransmits_before_loss_const="20"/> >>> > <quorumd allow_kill="0" interval="1" label="cluster_qdisk" >>> >>> >>> > tko="10" votes="1"> >>> > <heuristic >>> > program="/usr/share/cluster/check_eth_link.sh eth0" score="1" >>> > interval="2" tko="3"/> >>> >>> >>> > </quorumd> >>> > </cluster> >>> > >>> > >>> > The /etc/hosts: >>> > 172.24.119.10 node1 >>> > 172.24.119.34 node2 >>> > 15.15.2.10 node1-hb node1-hb.localdomain >>> >>> >>> > 15.15.2.11 node2-hb node2-hb.localdomain >>> > >>> > And the versions: >>> > Red Hat Enterprise Linux Server release 5.7 (Tikanga) >>> > cman-2.0.115-85.el5 >>> > rgmanager-2.0.52-21.el5 >>> >>> >>> > openais-0.80.6-30.el5 >>> > >>> > I don't know what else I should try, so if you can give me some >>> > ideas, I will be very pleased. >>> > >>> > Regards, Javi. >>> > >>> > -- >>> >>> >>> > Linux-cluster mailing list >>> > Linux-cluster@redhat.com <mailto:Linux-cluster@redhat.com> >>> >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> > >>> > >>> > >>> > >>> > -- >>> > esta es mi vida e me la vivo hasta que dios quiera >>> > >>> > -- Linux-cluster mailing list Linux-cluster@redhat.com >>> >>> > <mailto:Linux-cluster@redhat.com> >>> >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> > >>> > >>> > -- >>> > Linux-cluster mailing list >>> > Linux-cluster@redhat.com >>> >>> >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> > >>> >>> >>> -- >>> Digimer >>> >>> Papers and Projects: https://alteeve.com >>> >>> >> -- >> Linux-cluster mailing list >> Linux-cluster@redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster@redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster >
-- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster