If forgot to say: for 2)-or try to not put your proxmox host ip on vmbr0 but directly on ethX
you shouldn't have the ethX attached to a bridge ----- Mail original ----- De: [email protected] À: "Alexandre DERUMIER" <[email protected]> Cc: [email protected] Envoyé: Mardi 12 Mars 2013 19:40:00 Objet: Re: [PVE-User] Unreliable Hi, > can you post your /etc/network/interfaces ? root@kh-proxmox1:~# cat /etc/network/interfaces # network interface settings auto lo iface lo inet loopback iface eth0 inet manual iface eth1 inet manual auto vmbr0 iface vmbr0 inet static address 172.16.70.214 netmask 255.255.255.0 gateway 172.16.70.1 bridge_ports eth0 bridge_stp off bridge_fd 0 auto vmbr1 iface vmbr1 inet static address 172.16.60.214 netmask 255.255.255.0 bridge_ports eth1 bridge_stp off bridge_fd 0 Where vmbr0 is the rest of the network (and host traffic) and vmbr1 only connected with nodes + storage (storages connected with this nic). > What you can try: 1)- update last pve-kernel from pvetest repository 2)-or try to not put your proxmox host ip on vmbr0 but directly on ethX 3)-or on your cisco switch, disable igmp snooping. (#conf t #no ip igmp snooping ) Did anything from above help you? How did you solve the problem? I'll give it a try anyway, when it's late at night :-) Thanks, Steffen Am 12.03.2013 19:31, schrieb Alexandre DERUMIER: >>> yes there were problems with corosync and cman . I can remeber that.... >>> something like Member left membership.. blah blah. >>> >>> The used Cisco Switch is: >>> >>> Catalyst 2960-S Serie > good luck for you, I use cisco 2960g and I have notice theses problem too. > > What you can try: > > 1)- update last pve-kernel from pvetest repository > > 2)-or try to not put your proxmox host ip on vmbr0 but directly on ethX > > 3)-or on your cisco switch, disable igmp snooping. > (#conf t > #no ip igmp snooping > ) > > > for 1,2 , verify that your cisco switch have "ip igmp snooping querier" > > The problem is that current redhat kernel send igmp queries from linux bridge > to network, and conflict with cisco switchs. > This behaviour has been change recently in 3.5 kernel but not in redhat > kernel. So we have patched it. > > I hope it'll resolve yours problems :) > > > > > >>> http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum >>> >>> Could the problem also be the same problem the which is desribed in the >>> last post? Nodes are connected with the iscsi storage (qnap nas) through >>> the cisco switch on VLANX. The other nic is used to bridge the VMs and >>> connect them and the hosts to the rest of the network (so also "host" >>> traffic goes through this VLAN)... > can you post your /etc/network/interfaces ? > > ----- Mail original ----- > > De: [email protected] > À: "Alexandre DERUMIER" <[email protected]> > Cc: [email protected] > Envoyé: Mardi 12 Mars 2013 18:58:31 > Objet: Re: [PVE-User] Unreliable > > Hi, > > yes there were problems with corosync and cman . I can remeber that.... > something like Member left membership.. blah blah. > > The used Cisco Switch is: > > Catalyst 2960-S Serie > Produkt ID: WS-C2960S-24TS-S > Version ID:V02 > Software: 12.2(55)SE3 > > Here some of the Logs i found (daemon.log): > > Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [quorum] crit: quorum_dispatch > failed: 2 > Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [confdb] crit: confdb_dispatch > failed: 2 > Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatch > failed: 2 > Feb 12 14:50:01 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2 > Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatch > failed: 2 > Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 2 > Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 2 > Feb 12 14:50:06 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2 > Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 2 > Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 2 > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: > quorum_initialize failed: 6 > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't > initialize service > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [confdb] crit: > confdb_initialize failed: 6 > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't > initialize service > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start cluster > connection > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initialize > failed: 6 > Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't > initialize service > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 2 > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 2 > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start cluster > connection > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initialize > failed: 6 > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't > initialize service > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 9 > Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: > cpg_send_message failed: 9 > > > And then it continues with the last line for thousands of lines... (so > it means that node lost the quorum in the cluster.) > > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: > 1/1579, 2/1535 > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting data > syncronisation > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: > 1/1579, 2/1535 > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting data > syncronisation > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: > 1/1579, 2/1535, 3/398566 > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: > 1/1579, 2/1535, 3/398566 > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received sync > request (epoch 1/1579/0000000A) > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received sync > request (epoch 1/1579/0000000A) > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received all > states > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: leader is 1/1579 > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: synced members: > 1/1579, 2/1535, 3/398566 > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is up > to date > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received all > states > Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is up > to date > Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [quorum] crit: quorum_dispatch > failed: 2 > Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [confdb] crit: confdb_dispatch > failed: 2 > Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatch > failed: 2 > Feb 12 16:06:50 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2 > Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit: > cpg_send_message failed: 2 > Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit: > cpg_send_message failed: 2 > Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatch > failed: 2 > Feb 12 16:06:54 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2 > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [libqb] warning: > epoll_ctl(del): Bad file descriptor (9) > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: > quorum_initialize failed: 6 > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't > initialize service > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [confdb] crit: > confdb_initialize failed: 6 > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't > initialize service > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: start cluster > connection > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initialize > failed: 6 > Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't > initialize service > Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: > cpg_send_message failed: 2 > Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: > cpg_send_message failed: 2 > Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initialize > failed: 6 > Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't > initialize service > Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: > cpg_send_message failed: 9 > Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: > cpg_send_message failed: 9 > > Again you see first all nodes are synced and then it looses again quorum. > > It's a like described in this post: > > http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum > > Could the problem also be the same problem the which is desribed in the > last post? Nodes are connected with the iscsi storage (qnap nas) through > the cisco switch on VLANX. The other nic is used to bridge the VMs and > connect them and the hosts to the rest of the network (so also "host" > traffic goes through this VLAN)... > >> pveperf > CPU BOGOMIPS: 55876.08 > REGEX/SECOND: 1476041 > HD SIZE: 94.49 GB (/dev/mapper/pve-root) > BUFFERED READS: 144.93 MB/sec > AVERAGE SEEK TIME: 8.15 ms > FSYNCS/SECOND: 30.69 > DNS EXT: 58.96 ms > > Thanks, > Steffen Wagner > > P.S. Sorry Alexandre, i pressed the wrong button :-) > > Am 12.03.2013 17:49, schrieb Alexandre DERUMIER: >> Hi Steffen, >> >> Seem that you have multicast errors/hang which cause corosync error. >> What physicals switchs do you use ? (I ask this because we have found a >> multicast bug with a feature of current kernel and cisco swithcs) >> >> >> >> >> 2013/3/12 Steffen Wagner < [email protected] > >> >> >> Hi, >> >> I had a similiar problem with 2.2 >> I had rgmanager for HA features running on high end hardware (Dell, QNAP and >> Cisco). After about three days one of the nodes (it wasnt always the same!) >> left quorum (log said something like 'node 2 left, x nodes remaining in >> cluster, fencing node 2.'. After then always the node was successfully >> fenced... so i disabled fencing and changed it to 'hand'. Then the node >> didnt shut down anymore. It remained online with all vms, but the cluster >> said the node was offline (at reboot the node stuck at pve rgmanager >> service, only hardreset was possible). >> >> In the end i disabled HA and ran the nodes now only in cluster mode without >> fencing... working until now (3 months) without any problems... a pity, >> because i want to use HA features, but dont know whats wrong. >> >> My network setup is similiar as Fabio's. I'm using VLANs one for the storage >> interface and one for the other..... >> >> Until now i think i stay at 2.2 and do not upgrade to 2.3 until everyone in >> the maillist is happy :-) >> >> >> Mit freundlichen Grüßen, >> Steffen Wagner -- Steffen Wagner Im Obersteig 31 D-76879 Hochstadt / Pfalz M +49 (0) 1523 3544688 F +49 (0) 6347 918475 E [email protected] _______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
