Re: [PVE-User] Promox 4.3 cluster issue

Alwin Antreich Tue, 25 Oct 2016 06:03:03 -0700

Hi Szabolcs,

On 10/25/2016 12:24 PM, Szabolcs F. wrote:
> Hi Alwin,
> 
> bond0 is on two Cisco 4948 switches and bond1 is on two Cisco N3K-3064PQ
> switches. They worked fine for about two months in this setup. But last
> week (after I started to have these issues) I powered down one Cisco 4948
> and one N3K-3064PQ switch (in both cases the designated backup switches
> were powered down). This is to make sure all servers use the same switch as
> their active link. After that I stopped the Proxmox cluster (all nodes) and
> started them again, but the issue occurred again.


Ok, so one thing less to check. How are your remaining switch configured, 
especially, where the pve cluster is on? Do
they use jumbo frames? Or some network/port security?

> 
> I've just added the 'bond_primary ethX' option to the interfaces file. I'll
> reboot everything once again and see if it helps.

That's only going to be used, when you have all links connected and want to 
prefer a link to be the primary, eg. 10GbE
as primary and 1GbE as backup.

> 
> syslog: http://pastebin.com/MsuCcNx8
> dmesg: http://pastebin.com/xUPMKDJR
> pveproxy (I can only see access.log for pveproxy, so this is the service
> status): http://pastebin.com/gPPb4F3x

I couldn't find anything unusual, but that doesn't mean there isn't.

> 
> What other logs should I be reading?
> 
> Thanks
> 
> On Tue, Oct 25, 2016 at 11:23 AM, Alwin Antreich <sysadmin-...@cognitec.com>
> wrote:
> 
>> Hi Szabolcs,
>>
>> On 10/25/2016 10:01 AM, Szabolcs F. wrote:
>>> Hi Alwin,
>>>
>>> thanks for your hints.
>>>
>>>> On which interface is proxmox running on? Are these interfaces clogged
>>> because, there is some heavy network IO going on?
>>> I've got my two Intel Gbps network interfaces bonded together (bond0) as
>>> active-backup and vmbr0 is bridged on this bond, then Proxmox is running
>> on
>>> this interface. I.e. http://pastebin.com/WZKQ02Qu
>>> All nodes are configured like this. There is no heavy IO on these
>>> interfaces, because the storage network uses the separate 10Gbps fiber
>>> Intel NICs (bond1).
>>
>> Is your bond working properly? Is the bond on the same switch or two
>> different?
>>
>> Usually I add the "bond_primary ethX" option to set the interface that
>> should be primarily used in active-backup
>> configuration - side note. :-)
>>
>> What are the logs on the server showing? You know, syslog, dmesg,
>> pveproxy, etc. ;-)
>>
>>>
>>>> Another guess, are all servers synchronizing with a NTP server and have
>>> the correct time?
>>> Yes, NTP is working properly, the firewall lets all NTP request go
>> through.
>>>
>>>
>>> On Mon, Oct 24, 2016 at 5:19 PM, Alwin Antreich <
>> sysadmin-...@cognitec.com>
>>> wrote:
>>>
>>>> Hello Szabolcs,
>>>>
>>>> On 10/24/2016 03:16 PM, Szabolcs F. wrote:
>>>>> Hello,
>>>>>
>>>>> I've got a Proxmox VE 4.3 cluster of 12 nodes. All of them are Dell
>> C6220
>>>>> sleds. Each has 2x Intel Xeon E5-2670 CPU and 64GB RAM. I've got two
>>>>> separate networks: 1Gbps LAN (Cisco 4948 switch) and 10Gbps storage
>>>> (Cisco
>>>>> N3K-3064PQ fiber switch). The Dell nodes use the integrated Intel Gbit
>>>>> adapters for LAN and Intel PCI-E 10Gbps cards for the fiber network
>>>> (ixgbe
>>>>> driver). The storage servers are separate, they run FreeNAS and export
>>>> the
>>>>> shares with NFS. My virtual machines (I've made about 40 of them so
>> far)
>>>>> are KVM/QCOW2 and they are stored on the FreeNAS storage. So far so
>> good.
>>>>> I've been using this environment as a test and was almost ready to push
>>>>> into production.
>>>> On which interface is proxmox running on? Are these interfaces clogged
>>>> because, there is some heavy network IO going on?
>>>>>
>>>>> But I have a problem with the cluster. From time to time the pveproxy
>>>>> service dies on the nodes or the web UI lists all nodes (except the one
>>>> I'm
>>>>> actually logged into) as unreachable (red cross). Sometimes all nodes
>> are
>>>>> listed as working (green status) but if I try to connect to a virtual
>>>>> machine I get a 'connection refused' error. When the cluster acts up I
>>>>> can't do any VM migration and any other VM management (i.e. console,
>>>>> start/stop/reset, new VM, etc). When it happens the only way to recover
>>>> is
>>>>> powering down all 12 nodes and starting them one after another. Then
>>>>> everything works properly for a random amount of time: sometimes for
>>>> weeks,
>>>>> sometimes for only a few days.
>>>> Another guess, are all servers synchronizing with a NTP server and have
>>>> the correct time?
>>>>>
>>>>> I followed the network troubleshooting guide with omping, multicast,
>> etc
>>>>> and confirmed I've got multicase enabled and the troubleshooting didn't
>>>>> return any error. The /etc/hosts file is configured on all nodes with
>> the
>>>>> proper hostname/IP list of all nodes.
>>>>> When trying to do 'service pve-cluster restart' I get these errors:
>>>>> http://pastebin.com/NXnEf4rd (running pmxcsf manually mounts the
>>>> /etc/pve
>>>>> properly, but doesn't fix the cluster/proxy issue)
>>>>> pvecm status : http://pastebin.com/jsDFkqu3 (I powered down one node,
>>>>> that's why it's missing)
>>>>> pvecm nodes : http://pastebin.com/1WR8Yij8
>>>>> Corosync has a lot of these in the /var/logs/daemon.log :
>>>>> http://pastebin.com/ajhE8Rb9
>>>>>
>>>>> Someone please help!
>>>>>
>>>>> Thanks,
>>>>> Szabolcs
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@pve.proxmox.com
>>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>
>>>>
>>>> --
>>>> Cheers,
>>>> Alwin
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@pve.proxmox.com
>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@pve.proxmox.com
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>
>> --
>> Cheers,
>> Alwin
>> _______________________________________________
>> pve-user mailing list
>> pve-user@pve.proxmox.com
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> _______________________________________________
> pve-user mailing list
> pve-user@pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 

When that happens, is the network working correctly between hosts?

-- 
Cheers,
Alwin
_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Promox 4.3 cluster issue

Reply via email to