Hi again folks... One update here:
- I'd removed bonding for cluster heartbeat (bond0) and setup it direct on eth0 for all nodes. This solves the issue for membership. Now I can boot up all 4 nodes, join fence domain, start clvmd on them. Everything is stable and I didn't see random messages about "openais retransmit" anymore. Of course, I still have a problem :). I've 1 GFS filesystem and 16 GFS2 filesystems. I can mount all filesystems on node1 and node2 (same build/switch), but when I try to run "service gfs2 start" on node3 or node4 (another build/switch) the things becomes unstable and whole cluster fail with infinity messages about "cpg_mcast_retry RETRY_NUMBER". Log can be found here: http://pastebin.com/m2f26ab1d What apparently happened is that without bonding setup the network layer becomes more "simple" and could handle with membership but still cant handle with GFS/GFS2 heartbeat. I've set nodes to talk IGMPv2, as said at: http://archives.free.net.ph/message/20081001.223026.9cf6d7bf.de.html Well.. any hints? Thanks again. -- Flávio do Carmo Júnior aka waKKu On Thu, Apr 30, 2009 at 1:41 PM, Flavio Junior <[email protected]> wrote: > Hi Abraham, thanks for your answer. > > I'd configured your suggestion to cluster.conf but still gets the same > problem. > > Here is what I did: > * Disable cman init script on boot for all nodes > * Edit config file and copy it for all nodes > * reboot all > * start cman on node1 (OK) > * start cman on node2 (OK) > * start cman on node3 (problems to become member, fence node2) > > Here is the log file with this process 'til the fence: > http://pastebin.com/f477e7114 > > PS: node1 and node2 as on the same switch at site1. node3 and node4 as > on the same switch at site2. > > Thanks again, any other suggestions ? > > I dont know if it would help but, is corosync a feasible option for > production use? > > -- > > Flávio do Carmo Júnior aka waKKu > > On Wed, Apr 29, 2009 at 10:19 PM, Abraham Alawi <[email protected]> > wrote: > > If not tried already, the following settings in cluster.conf might help > > especially "clean_start" > > > > <fence_daemon clean_start="1" post_fail_delay="5" post_join_delay="15"/> > > clean_start --> assume the cluster is in healthy state upon startup > > post_fail_delay --> seconds to wait before fencing a node that thinks it > > should be fenced (i.e. lost connection with) > > post_join_delay --> seconds to wait before fencing any node that should > be > > fenced upon startup (right after joining) > > > > On 30/04/2009, at 8:21 AM, Flavio Junior wrote: > > > >> Hi folks, > >> > >> I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've > >> another 2-node cluster using CentOS 5.3 without problem. > >> > >> Well.. My scenario is as follow: > >> > >> * System configuration and info: http://pastebin.com/f41d63624 > >> > >> * Network: > >> > http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg > >> * Switches on loop are 3Com 2924 (or 2948)-SFP > >> * Have STP enabled (RSTP auto) > >> * IGMP Snooping Disabled as: > >> > >> > http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ > >> comment 32 > >> * Yellow lines are a fiber link 990ft (330mts) single-mode > >> * I'm using a dedicated tagged VLAN for cluster-heartbeat > >> * I'm using 2 NIC's with bonding mode=1 (active/backup) for > >> heartbeat and 4 NIC's to "public" > >> * Every node has your public four cables plugged on same switch and > >> Link-Aggregation on it > >> * Looking to the picture, that 2 switches with below fiber link is > >> where the nodes are plugged. 2 nodes each build. > >> > >> SAN: http://img139.imageshack.us/img139/642/clusters.jpg > >> * Switches: Brocade TotalStorage 16SAN-B > >> * Storages: IBM DS4700 72A (using ERM for sync replication (storage > >> level)) > >> > >> My problem is: > >> > >> I can't get the 4 nodes up. Every time the fourth (sometimes even the > >> third) node becomes online i got one or two of them fenced. I keep > >> getting messages about openais/cman, cpg_mcast_joined very often: > >> --- snipped --- > >> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900 > >> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000 > >> --- snipped --- > >> > >> Is really seldom the times I can get a node to boot up and join on > >> fence domain, almost every time it hangs and i need to reboot and try > >> again or either reboot, enter single mode, disable cman, reboot, keep > >> trying to service cman start/stop. Sometimes another nodes can see the > >> node in domain but boot keeps hangs on "Starting fenced..." > >> > >> ######## > >> [r...@athos ~]# cman_tool services > >> type level name id state > >> fence 0 default 00010001 none > >> [1 3 4] > >> dlm 1 clvmd 00020001 none > >> [1 3 4] > >> [r...@athos ~]# cman_tool nodes -f > >> Node Sts Inc Joined Name > >> 0 M 0 2009-04-29 15:16:47 > >> /dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b > >> 1 M 7556 2009-04-29 15:16:35 athos-priv > >> Last fenced: 2009-04-29 15:13:49 by athos-ipmi > >> 2 X 7820 porthos-priv > >> Last fenced: 2009-04-29 15:31:01 by porthos-ipmi > >> Node has not been fenced since it went down > >> 3 M 7696 2009-04-29 15:27:15 aramis-priv > >> Last fenced: 2009-04-29 15:24:17 by aramis-ipmi > >> 4 M 8232 2009-04-29 16:12:34 dartagnan-priv > >> Last fenced: 2009-04-29 16:09:53 by dartagnan-ipmi > >> [r...@athos ~]# ssh r...@aramis-priv > >> ssh: connect to host aramis-priv port 22: Connection refused > >> [r...@athos ~]# ssh r...@dartagnan-priv > >> ssh: connect to host dartagnan-priv port 22: Connection refused > >> [r...@athos ~]# > >> ######### > >> > >> (I know how unreliable is ssh, but I'm seeing the console screen > >> hanged.. Just trying to show it) > >> > >> > >> The BIG log file: http://pastebin.com/f453c220 > >> Every entry on this log after 16:54h is when node2 (porthos-priv > >> 172.16.1.2) was booting and hanged on "Starting fenced..." > >> > >> > >> I've no more ideias to try solve this problem, any hints is > >> appreciated. If you need any other info, just tell me how to get it > >> and I'll post just after I read. > >> > >> > >> Very thanks, in advance. > >> > >> -- > >> > >> Flávio do Carmo Júnior aka waKKu > >> > >> -- > >> Linux-cluster mailing list > >> [email protected] > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > '''''''''''''''''''''''''''''''''''''''''''''''''''''' > > Abraham Alawi > > > > Unix/Linux Systems Administrator > > Science IT > > University of Auckland > > e: [email protected] > > p: +64-9-373 7599, ext#: 87572 > > > > '''''''''''''''''''''''''''''''''''''''''''''''''''''' > > > > >
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
