Re: [PVE-User] Unreliable

[email protected] Tue, 12 Mar 2013 10:58:37 -0700

Hi,

yes there were problems with corosync and cman . I can remeber that....something like Member left membership.. blah blah.


The used Cisco Switch is:

Catalyst 2960-S Serie
Produkt ID: WS-C2960S-24TS-S
Version ID:V02
Software: 12.2(55)SE3

Here some of the Logs i found (daemon.log):

Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [quorum] crit: quorum_dispatchfailed: 2Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [confdb] crit: confdb_dispatchfailed: 2Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatchfailed: 2

Feb 12 14:50:01 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2

Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatchfailed: 2Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 2Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 2

Feb 12 14:50:06 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2

Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 2Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 2Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit:quorum_initialize failed: 6Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can'tinitialize serviceFeb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [confdb] crit:confdb_initialize failed: 6Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can'tinitialize serviceFeb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start clusterconnectionFeb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initializefailed: 6Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can'tinitialize serviceFeb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 2Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 2Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start clusterconnectionFeb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initializefailed: 6Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can'tinitialize serviceFeb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 9Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:cpg_send_message failed: 9

And then it continues with the last line for thousands of lines... (soit means that node lost the quorum in the cluster.)

Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:1/1579, 2/1535Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting datasyncronisationFeb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:1/1579, 2/1535Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting datasyncronisationFeb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:1/1579, 2/1535, 3/398566Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:1/1579, 2/1535, 3/398566Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received syncrequest (epoch 1/1579/0000000A)Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received syncrequest (epoch 1/1579/0000000A)Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received allstates

Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: leader is 1/1579

Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: synced members:1/1579, 2/1535, 3/398566Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is upto dateFeb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received allstatesFeb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is upto dateFeb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [quorum] crit: quorum_dispatchfailed: 2Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [confdb] crit: confdb_dispatchfailed: 2Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatchfailed: 2

Feb 12 16:06:50 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2

Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit:cpg_send_message failed: 2Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit:cpg_send_message failed: 2Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatchfailed: 2

Feb 12 16:06:54 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2

Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [libqb] warning:epoll_ctl(del): Bad file descriptor (9)Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit:quorum_initialize failed: 6Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can'tinitialize serviceFeb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [confdb] crit:confdb_initialize failed: 6Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can'tinitialize serviceFeb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: start clusterconnectionFeb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initializefailed: 6Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can'tinitialize serviceFeb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:cpg_send_message failed: 2Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:cpg_send_message failed: 2Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initializefailed: 6Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can'tinitialize serviceFeb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:cpg_send_message failed: 9Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:cpg_send_message failed: 9


Again you see first all nodes are synced and then it looses again quorum.

It's a like described in this post:

http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum

Could the problem also be the same problem the which is desribed in thelast post? Nodes are connected with the iscsi storage (qnap nas) throughthe cisco switch on VLANX. The other nic is used to bridge the VMs andconnect them and the hosts to the rest of the network (so also "host"traffic goes through this VLAN)...


>pveperf
CPU BOGOMIPS:      55876.08
REGEX/SECOND:      1476041
HD SIZE:           94.49 GB (/dev/mapper/pve-root)
BUFFERED READS:    144.93 MB/sec
AVERAGE SEEK TIME: 8.15 ms
FSYNCS/SECOND:     30.69
DNS EXT:           58.96 ms

Thanks,
Steffen Wagner

P.S. Sorry Alexandre, i pressed the wrong button :-)

Am 12.03.2013 17:49, schrieb Alexandre DERUMIER:

Hi Steffen,

Seem that you have multicast errors/hang which cause corosync error.
What physicals switchs do you use ? (I ask this because we have found a 
multicast bug with a feature of current kernel and cisco swithcs)




2013/3/12 Steffen Wagner < [email protected] >


Hi,

I had a similiar problem with 2.2
I had rgmanager for HA features running on high end hardware (Dell, QNAP and 
Cisco). After about three days one of the nodes (it wasnt always the same!) 
left quorum (log said something like 'node 2 left, x nodes remaining in 
cluster, fencing node 2.'. After then always the node was successfully 
fenced... so i disabled fencing and changed it to 'hand'. Then the node didnt 
shut down anymore. It remained online with all vms, but the cluster said the 
node was offline (at reboot the node stuck at pve rgmanager service, only 
hardreset was possible).

In the end i disabled HA and ran the nodes now only in cluster mode without 
fencing... working until now (3 months) without any problems... a pity, because 
i want to use HA features, but dont know whats wrong.

My network setup is similiar as Fabio's. I'm using VLANs one for the storage 
interface and one for the other.....

Until now i think i stay at 2.2 and do not upgrade to 2.3 until everyone in the 
maillist is happy :-)


Mit freundlichen Grüßen,
Steffen Wagner


--
Steffen Wagner
Im Obersteig 31
D-76879 Hochstadt / Pfalz

M +49 (0) 1523 3544688
F +49 (0) 6347 918475
E [email protected]

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Unreliable

Reply via email to