Re: [PVE-User] Unreliable

Alexandre DERUMIER Tue, 12 Mar 2013 09:49:47 -0700

Hi Steffen,

Seem that you have multicast errors/hang which cause corosync error.
What physicals switchs do you use ? (I ask this because we have found a 
multicast bug with a feature of current kernel and cisco swithcs)





2013/3/12 Steffen Wagner < [email protected] > 


Hi, 

I had a similiar problem with 2.2 
I had rgmanager for HA features running on high end hardware (Dell, QNAP and 
Cisco). After about three days one of the nodes (it wasnt always the same!) 
left quorum (log said something like 'node 2 left, x nodes remaining in 
cluster, fencing node 2.'. After then always the node was successfully 
fenced... so i disabled fencing and changed it to 'hand'. Then the node didnt 
shut down anymore. It remained online with all vms, but the cluster said the 
node was offline (at reboot the node stuck at pve rgmanager service, only 
hardreset was possible). 

In the end i disabled HA and ran the nodes now only in cluster mode without 
fencing... working until now (3 months) without any problems... a pity, because 
i want to use HA features, but dont know whats wrong. 

My network setup is similiar as Fabio's. I'm using VLANs one for the storage 
interface and one for the other..... 

Until now i think i stay at 2.2 and do not upgrade to 2.3 until everyone in the 
maillist is happy :-) 


Mit freundlichen Grüßen, 
Steffen Wagner 
-- 

Im Obersteig 31 
76879 Hochstadt/Pfalz 

E [email protected] 
M 01523/3544688 
F 06347/918474 

Fábio Rabelo < [email protected] > schrieb: 

>2013/3/12 Andreu Sànchez i Costa < [email protected] > 
> 
>> Hello Fábio, 
>> 
>> Al 12/03/13 01:00, En/na Fábio Rabelo ha escrit: 
>> 
>> 
>> 2.3 do not have the reliability 1.9 has !!!! 
>> 
>> I am struggling with it for 3 months, my deadline are gone, and I cannot 
>> make it work for more than 3 days without an issue ... 
>> 
>> 
>> I cannot give my opinion about 2.3 but with 2.2.x it works perfectly, I 
>> only had to change elevator to deadline cause CFQ had performance problems 
>> with our P2000 iSCSI array disk. 
>> 
>> As other list members asked, what are your main problems? 
>> 
>> 
>I already described the problems several times here . 
> 
>This is a five node cluster, motherboards dual opteron from Supermicro . 
> 
>Storage uses the same motherboard as the five nodes, but with a 16 3,5 HD 
>slots, with 12 occupied by WD enterprise disks . 
> 
>Storage runs Nas4Free . ( already try Freenas, same result ) 
> 
>Like I said, when I installed PVE 1.9 everything works fine for, now 9 
>days, and counting . 
> 
>In the five nodes, are embedded 2 network ports, connected to Linksys 
>switcher, I am using it to serve the VMs . 
> 
>In one PCIe Slot there are an Intel 10 GB card, to talk with a Supermicro 
>10 GB switcher, exclusive to communication between the five nodes and the 
>Storage . 
> 
>This switcher have no link with anything else . 
> 
>In the Storage, I use one of the embedded ports to manage, and all images 
>are served through 10 GB card . 
> 
>After sometime, between 1 and 3 days the system is working, the nodes stops 
>to talk with the storage . 
> 
>When it happens, the log shows lots of msg like this : 
> 
>Mar 6 17:15:29 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:15:39 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:15:49 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:15:59 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:09 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:19 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:29 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:39 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:49 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:59 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
> 
> 
> 
>After that, if I try to restart the pve daemon, it refuses to . 
> 
>If I try to reboot the server, it stops when the PVE daemon should stops, 
>and stays there forever . 
> 
>The only way to reboot any of the nodes is a hard reset ! 
> 
>At first, I my suspects goes to Storage, changed from Freenas to Nas4Free, 
>sane thing, desperation ! 
> 
>Then, for tests, I installed PVE 1.9 In all five nodes ( I have 2 systems 
>running it for 3 years, so issue, this new system are to replace both ) 
> 
>Like I said, 9 days and counting !!! 
> 
>So, there is no problem in the hardware, and there is no problem with 
>Nas4Free ! 
> 
>What left ?!? 
> 
> 
>Fábio Rabelo 
> 
>_______________________________________________ 
>pve-user mailing list 
> [email protected] 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user 




_______________________________________________ 
pve-user mailing list 
[email protected] 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user 
_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Unreliable

Reply via email to