HiI have a cluster with 13 working nodes in a dedicated VLAN. Using 4 switches - DELL 10G, NetExtreme 10G and 2xNetgear 1G for some nodes with 1G interfaces. We're using latest Proxmox with all updated packages. There's one difference, though - 3 nodes use 2.6.32-34-pve kernel as they had IPv6 issues with the latest kernel (2.6.32-37-pve - working on other nodes).
Everything is working good until I try to add a new node. As soon as I do that, whole GUI breaks (KVM stays working, luckily) and "all hell breaks loose," as it's said.
So, we have eliminated network card issues - as this problem occurs with different network cards. We have eliminated switches' issues, because all switches are working prior to this situation AND we have tried to use 10GB->1GB gbic module to connect this new node to 10G switch as well. Now, we have eliminated this Fujitsu hardware totally, because a HP machine also breaks the cluster.
IGMP snooping is disabled, multicast is working on both sides, tested with ssmping.
*clustat* shows that all nodes are online.*pvecm nodes* shows that everything is OK. All nodes have "join" time and "M" in Sts column. "Inc" differs, though.
*tcpdump* shows:
12:15:57.535798 IP 0.0.0.0 > all-systems.mcast.net: igmp query v212:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 > ff02::1: HBH ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 2412:15:57.540356 IP 0.0.0.0 > all-systems.mcast.net: igmp query v212:15:57.540384 IP6 101:80a:21ee:154d:100:: > ff02::1: HBH ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 2412:15:57.580874 IP 0.0.0.0 > all-systems.mcast.net: igmp query v212:15:57.580903 IP6 10::40:918f:a47f:0 > ff02::1: HBH ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 2412:15:58.349706 IP valitseja.5404 > harija1.5405: UDP, length 107 12:15:58.349783 IP harija1.5404 > ve-1.5405: UDP, length 617 12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown),length 42
Output from log files:
Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 13 14 15 Apr 9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2960 Apr 9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2970 Apr 9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2980 Apr 9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2990 Apr 9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3000 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3010 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3020 Apr 9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3030 Apr 9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3040 Apr 9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3050
I have read that Proxmox tests with 16 working nodes, but there are information that someone uses it with more than 16. Although - I have plenty to go? Of course we have had nodes, which are not in cluster anymore (deleted), but I assume that they don't count. :)
Any ideas where to look next? All the best Sten
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
