Re: [PVE-User] Unreliable

Alexandre DERUMIER Tue, 12 Mar 2013 20:03:07 -0700

If forgot to say:

for 2)-or try to not put your proxmox host ip on vmbr0 but directly on ethX


you shouldn't have the ethX attached to a bridge


----- Mail original ----- 

De: [email protected] 
À: "Alexandre DERUMIER" <[email protected]> 
Cc: [email protected] 
Envoyé: Mardi 12 Mars 2013 19:40:00 
Objet: Re: [PVE-User] Unreliable 

Hi, 

> can you post your /etc/network/interfaces ? 

root@kh-proxmox1:~# cat /etc/network/interfaces 
# network interface settings 
auto lo 
iface lo inet loopback 

iface eth0 inet manual 

iface eth1 inet manual 

auto vmbr0 
iface vmbr0 inet static 
address 172.16.70.214 
netmask 255.255.255.0 
gateway 172.16.70.1 
bridge_ports eth0 
bridge_stp off 
bridge_fd 0 

auto vmbr1 
iface vmbr1 inet static 
address 172.16.60.214 
netmask 255.255.255.0 
bridge_ports eth1 
bridge_stp off 
bridge_fd 0 

Where vmbr0 is the rest of the network (and host traffic) and vmbr1 only 
connected with nodes + storage (storages connected with this nic). 

> What you can try: 

1)- update last pve-kernel from pvetest repository 

2)-or try to not put your proxmox host ip on vmbr0 but directly on ethX 

3)-or on your cisco switch, disable igmp snooping. 
(#conf t 
#no ip igmp snooping 
) 

Did anything from above help you? How did you solve the problem? I'll 
give it a try anyway, when it's late at night :-) 


Thanks, 
Steffen 

Am 12.03.2013 19:31, schrieb Alexandre DERUMIER: 
>>> yes there were problems with corosync and cman . I can remeber that.... 
>>> something like Member left membership.. blah blah. 
>>> 
>>> The used Cisco Switch is: 
>>> 
>>> Catalyst 2960-S Serie 
> good luck for you, I use cisco 2960g and I have notice theses problem too. 
> 
> What you can try: 
> 
> 1)- update last pve-kernel from pvetest repository 
> 
> 2)-or try to not put your proxmox host ip on vmbr0 but directly on ethX 
> 
> 3)-or on your cisco switch, disable igmp snooping. 
> (#conf t 
> #no ip igmp snooping 
> ) 
> 
> 
> for 1,2 , verify that your cisco switch have "ip igmp snooping querier" 
> 
> The problem is that current redhat kernel send igmp queries from linux bridge 
> to network, and conflict with cisco switchs. 
> This behaviour has been change recently in 3.5 kernel but not in redhat 
> kernel. So we have patched it. 
> 
> I hope it'll resolve yours problems :) 
> 
> 
> 
> 
> 
>>> http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum 
>>> 
>>> Could the problem also be the same problem the which is desribed in the 
>>> last post? Nodes are connected with the iscsi storage (qnap nas) through 
>>> the cisco switch on VLANX. The other nic is used to bridge the VMs and 
>>> connect them and the hosts to the rest of the network (so also "host" 
>>> traffic goes through this VLAN)... 
> can you post your /etc/network/interfaces ? 
> 
> ----- Mail original ----- 
> 
> De: [email protected] 
> À: "Alexandre DERUMIER" <[email protected]> 
> Cc: [email protected] 
> Envoyé: Mardi 12 Mars 2013 18:58:31 
> Objet: Re: [PVE-User] Unreliable 
> 
> Hi, 
> 
> yes there were problems with corosync and cman . I can remeber that.... 
> something like Member left membership.. blah blah. 
> 
> The used Cisco Switch is: 
> 
> Catalyst 2960-S Serie 
> Produkt ID: WS-C2960S-24TS-S 
> Version ID:V02 
> Software: 12.2(55)SE3 
> 
> Here some of the Logs i found (daemon.log): 
> 
> Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [quorum] crit: quorum_dispatch 
> failed: 2 
> Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [confdb] crit: confdb_dispatch 
> failed: 2 
> Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatch 
> failed: 2 
> Feb 12 14:50:01 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2 
> Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatch 
> failed: 2 
> Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 14:50:06 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2 
> Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: 
> quorum_initialize failed: 6 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't 
> initialize service 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [confdb] crit: 
> confdb_initialize failed: 6 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't 
> initialize service 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start cluster 
> connection 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initialize 
> failed: 6 
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't 
> initialize service 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start cluster 
> connection 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initialize 
> failed: 6 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't 
> initialize service 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 9 
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit: 
> cpg_send_message failed: 9 
> 
> 
> And then it continues with the last line for thousands of lines... (so 
> it means that node lost the quorum in the cluster.) 
> 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: 
> 1/1579, 2/1535 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting data 
> syncronisation 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: 
> 1/1579, 2/1535 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting data 
> syncronisation 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: 
> 1/1579, 2/1535, 3/398566 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members: 
> 1/1579, 2/1535, 3/398566 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received sync 
> request (epoch 1/1579/0000000A) 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received sync 
> request (epoch 1/1579/0000000A) 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received all 
> states 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: leader is 1/1579 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: synced members: 
> 1/1579, 2/1535, 3/398566 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is up 
> to date 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received all 
> states 
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is up 
> to date 
> Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [quorum] crit: quorum_dispatch 
> failed: 2 
> Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [confdb] crit: confdb_dispatch 
> failed: 2 
> Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatch 
> failed: 2 
> Feb 12 16:06:50 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2 
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatch 
> failed: 2 
> Feb 12 16:06:54 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [libqb] warning: 
> epoll_ctl(del): Bad file descriptor (9) 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: 
> quorum_initialize failed: 6 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't 
> initialize service 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [confdb] crit: 
> confdb_initialize failed: 6 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't 
> initialize service 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: start cluster 
> connection 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initialize 
> failed: 6 
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't 
> initialize service 
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: 
> cpg_send_message failed: 2 
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initialize 
> failed: 6 
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't 
> initialize service 
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: 
> cpg_send_message failed: 9 
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit: 
> cpg_send_message failed: 9 
> 
> Again you see first all nodes are synced and then it looses again quorum. 
> 
> It's a like described in this post: 
> 
> http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum 
> 
> Could the problem also be the same problem the which is desribed in the 
> last post? Nodes are connected with the iscsi storage (qnap nas) through 
> the cisco switch on VLANX. The other nic is used to bridge the VMs and 
> connect them and the hosts to the rest of the network (so also "host" 
> traffic goes through this VLAN)... 
> 
>> pveperf 
> CPU BOGOMIPS: 55876.08 
> REGEX/SECOND: 1476041 
> HD SIZE: 94.49 GB (/dev/mapper/pve-root) 
> BUFFERED READS: 144.93 MB/sec 
> AVERAGE SEEK TIME: 8.15 ms 
> FSYNCS/SECOND: 30.69 
> DNS EXT: 58.96 ms 
> 
> Thanks, 
> Steffen Wagner 
> 
> P.S. Sorry Alexandre, i pressed the wrong button :-) 
> 
> Am 12.03.2013 17:49, schrieb Alexandre DERUMIER: 
>> Hi Steffen, 
>> 
>> Seem that you have multicast errors/hang which cause corosync error. 
>> What physicals switchs do you use ? (I ask this because we have found a 
>> multicast bug with a feature of current kernel and cisco swithcs) 
>> 
>> 
>> 
>> 
>> 2013/3/12 Steffen Wagner < [email protected] > 
>> 
>> 
>> Hi, 
>> 
>> I had a similiar problem with 2.2 
>> I had rgmanager for HA features running on high end hardware (Dell, QNAP and 
>> Cisco). After about three days one of the nodes (it wasnt always the same!) 
>> left quorum (log said something like 'node 2 left, x nodes remaining in 
>> cluster, fencing node 2.'. After then always the node was successfully 
>> fenced... so i disabled fencing and changed it to 'hand'. Then the node 
>> didnt shut down anymore. It remained online with all vms, but the cluster 
>> said the node was offline (at reboot the node stuck at pve rgmanager 
>> service, only hardreset was possible). 
>> 
>> In the end i disabled HA and ran the nodes now only in cluster mode without 
>> fencing... working until now (3 months) without any problems... a pity, 
>> because i want to use HA features, but dont know whats wrong. 
>> 
>> My network setup is similiar as Fabio's. I'm using VLANs one for the storage 
>> interface and one for the other..... 
>> 
>> Until now i think i stay at 2.2 and do not upgrade to 2.3 until everyone in 
>> the maillist is happy :-) 
>> 
>> 
>> Mit freundlichen Grüßen, 
>> Steffen Wagner 

-- 
Steffen Wagner 
Im Obersteig 31 
D-76879 Hochstadt / Pfalz 

M +49 (0) 1523 3544688 
F +49 (0) 6347 918475 
E [email protected] 
_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Unreliable

Reply via email to