Srry about partial post - web browser did that for me. On Thu, Aug 27, 2009 at 8:58 AM, Alan A <[email protected]> wrote:
> I decided this morning to start checking packages/versions first. Here are > some details about the system thus far: > > CONF: > <?xml version="1.0" ?> > <cluster alias="mrcluster" config_version="2" name="mrcluster"> > <fence_daemon post_fail_delay="0" post_join_delay="30"/> > <clusternodes> > <clusternode name="clxmrcati12.xxxxxx.com" nodeid="1" votes="1"> > <fence> > <method name="1"> > <device name="apcps05" option="off" port="3" > switch="3"/> > <device name="apcps06" option="off" port="3" > switch="3"/> > <device name="apcps05" option="on" port="3" > switch="3"/> > <device name="apcps06" option="on" port="3" > switch="3"/> > </method> > </fence> > </clusternode> > <clusternode name="clxmrcati11.xxxxxx.com" nodeid="2" votes="1"> > <fence> > <method name="1"> > <device name="apcps05" option="off" port="4" > switch="4"/> > <device name="apcps06" option="off" port="4" > switch="4"/> > <device name="apcps05" option="on" port="4" > switch="4"/> > <device name="apcps06" option="on" port="4" > switch="4"/> > </method> > </fence> > </clusternode> > <clusternode name="clxmrweb20.xxxxxx.com" nodeid="3" votes="1"> > <fence> > <method name="1"> > <device name="apcps05" option="off" port="2" > switch="2"/> > <device name="apcps06" option="off" port="2" > switch="2"/> > <device name="apcps05" option="on" port="2" > switch="2"/> > <device name="apcps06" option="on" port="2" > switch="2"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman/> > <fencedevices> > <fencedevice agent="fence_apc" ipaddr="172.XX.XX.27" login="apc" > name="apcps05" passwd="xxx"/> > <fencedevice agent="fence_apc" ipaddr="172.XX.XX..28" login="apc" > name="apcps06" passwd="xxx"/> > </fencedevices> > <rm> > <failoverdomains/> > <resources/> > </rm> > </cluster> > > ------------------------------------------------------------------------------------------- > Host Files: > From Luci Node clxmrcati11: > 127.0.0.1 localhost.localdomain localhost > 172.XX.XX.18 clxmrcati11.xxxxxx.com clxmrcati11 > 172.XX.XX.19 clxmrcati12.xxxxxx.com clxmrcati12 > 172.XX.XX.20 clxmrrpt10.xxxxxx.com clxmrrpt10 > 172.XX.XX.21 clxmrweb20.xxxxxx.com clxmrweb20 > > From ricci node clxmrcati12: > 127.0.0.1 localhost.localdomain localhost > 172.XX.XX.19 clxmrcati12.maritz.com fenclxmrcati12 > 172.XX.XX.21 clxmrweb20.maritz.com I decided this morning to > start checking packages/versions first. Here are some details about the > system thus far: > > CONF: > <?xml version="1.0" ?> > <cluster alias="mrcluster" config_version="2" name="mrcluster"> > <fence_daemon post_fail_delay="0" post_join_delay="30"/> > <clusternodes> > <clusternode name="clxmrcati12.xxxxxx.com" nodeid="1" votes="1"> > <fence> > <method name="1"> > <device name="apcps05" option="off" port="3" > switch="3"/> > <device name="apcps06" option="off" port="3" > switch="3"/> > <device name="apcps05" option="on" port="3" > switch="3"/> > <device name="apcps06" option="on" port="3" > switch="3"/> > </method> > </fence> > </clusternode> > <clusternode name="clxmrcati11.xxxxxx.com" nodeid="2" votes="1"> > <fence> > <method name="1"> > <device name="apcps05" option="off" port="4" > switch="4"/> > <device name="apcps06" option="off" port="4" > switch="4"/> > <device name="apcps05" option="on" port="4" > switch="4"/> > <device name="apcps06" option="on" port="4" > switch="4"/> > </method> > </fence> > </clusternode> > <clusternode name="clxmrweb20.xxxxxx.com" nodeid="3" votes="1"> > <fence> > <method name="1"> > <device name="apcps05" option="off" port="2" > switch="2"/> > <device name="apcps06" option="off" port="2" > switch="2"/> > <device name="apcps05" option="on" port="2" > switch="2"/> > <device name="apcps06" option="on" port="2" > switch="2"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman/> > <fencedevices> > <fencedevice agent="fence_apc" ipaddr="172.XX.XX.27" login="apc" > name="apcps05" passwd="xxx"/> > <fencedevice agent="fence_apc" ipaddr="172.XX.XX..28" login="apc" > name="apcps06" passwd="xxx"/> > </fencedevices> > <rm> > <failoverdomains/> > <resources/> > </rm> > </cluster> > > ------------------------------------------------------------------------------------------- > Host Files: > From Luci Node clxmrcati11: > 127.0.0.1 localhost.localdomain localhost > 172.XX.XX.18 clxmrcati11.xxxxxx.com clxmrcati11 > 172.XX.XX.19 clxmrcati12.xxxxxx.com clxmrcati12 > 172.XX.XX.20 clxmrrpt10.xxxxxx.com clxmrrpt10 > 172.XX.XX.21 clxmrweb20.xxxxxx.com clxmrweb20 > > From ricci node clxmrcati12: > 127.0.0.1 localhost.localdomain localhost > 172.XX.XX.19 clxmrcati12.xxxxxx.com clxmrcati12 > 172.XX.XX.21 clxmrweb20.xxxxxx.com clxmrweb20 > 172.XX.XX.20 clxmrrpt10.xxxxxx.com clxmrrpt10 > 172.XX.XX.18 clxmrcati11.xxxxxx.com clxmrcati11 > > From ricci node clxmrweb20: > 127.0.0.1 localhost.localdomain localhost > 172.XX.XX.21 clxmrweb20.xxxxxx.com clxmrweb20 > 172.XX.XX.20 clxmrrpt10.xxxxxx.com clxmrrpt10 > 172.XX.XX.18 clxmrcati11.xxxxxx.com clxmrcati11 > 172.XX.XX.19 clxmrcati12.xxxxxx.com clxmrcati12 > > Mostly this in /var/log/messages: > Aug 25 09:36:12 fenclxmrcati11 dlm_controld[2267]: connect to ccs error > -111, check ccsd or cluster status > Aug 25 09:36:12 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:12 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection refused > Aug 25 09:36:12 fenclxmrcati11 gfs_controld[2273]: connect to ccs error > -111, check ccsd or cluster status > Aug 25 09:36:12 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:12 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection refused > Aug 25 09:36:13 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:13 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection refused > Aug 25 09:36:13 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:13 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection refused > Aug 25 09:36:13 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:13 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection refused > Aug 25 09:36:14 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:14 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection refused > Aug 25 09:36:14 fenclxmrcati11 ccsd[3758]: Cluster is not quorate. > Refusing connection. > Aug 25 09:36:14 fenclxmrcati11 ccsd[3758]: Error while processing connect: > Connection re > > > > On Thu, Aug 27, 2009 at 3:27 AM, Jakov Sosic <[email protected]> wrote: > >> On Wed, 26 Aug 2009 18:36:26 -0500 >> Alan A <[email protected]> wrote: >> >> > I have tried almost everything at this point to try and troubleshoot >> > this further. I can't create new cluster with luci. >> > >> > fenclxmrweb20 > 172.XX.XX.20 clxmrrpt10.maritz.com fenclxmrrpt10 > 172.XX.XX.18 clxmrcati11..com clxmrcati11 > > > > > > > On Thu, Aug 27, 2009 at 3:27 AM, Jakov Sosic <[email protected]> wrote: > >> On Wed, 26 Aug 2009 18:36:26 -0500 >> Alan A <[email protected]> wrote: >> >> > I have tried almost everything at this point to try and troubleshoot >> > this further. I can't create new cluster with luci. >> > >> > I broke and tried to reconfigure 3 node cluster at least 6 times. >> > >> > I have noticed nodes taking expectational long on initializing >> > fencing upon cman start. I tried with defined and undefined fencing, >> > the amount of time needed is still the same. Even after the fencing >> > is overcome in /var/log/messages nodes refuse to join cluster due to >> > the state of 'not in quorum' during joining process. I uped the >> > post_join_delay as much as 150 but the result is the same. >> > >> > Fencing - I use APC PW Switches - I can login into apc PWS from the >> > node, I can even fence the other node, but when cman is started it >> > looks like it is almost timign out on staring fencing. >> > >> > If I issue cman_tool nodes it gives me the local node name as the >> > member of the cluster and the other two with state 'X'. If I try >> > cman_tool join clustername - it tells me the nodes are already in >> > that cluster but cluster as the whole does not register. Each node >> > thinks it's the only working member of the cluster. >> > >> > >> > Any pointers? >> >> Looks like network issue to me. >> >> Are you sure your network is operational in a sense of a multicast / >> igmp? Try forcing igmp v1 in sysctl.conf - and if you have Cisco >> equipment take a look at openais FAQ (mode sparse-dense). >> >> >> -- >> | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | >> ================================================================= >> | start fighting cancer -> http://www.worldcommunitygrid.org/ | >> >> -- >> Linux-cluster mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Alan A. > -- Alan A.
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
