Hi, Let me start off with saying that I am not fingerpointing at anyone, merely looking for how to prevent sh*t from happening again!
Last month I emailed about issues with pve-firewall. I was told that there were fixes in the newest packages, so this maintenance I started with upgrading pve-firewall before anything else. Which went well for about all the clusters I upgraded. Then I ended up at the last (biggest, 9 nodes) cluster, and stuff got pretty ugly. Here's what happened: 1: I enabled IPv6 on the cluster interfaces in the last month. I've done this before on other clusters, nothing special there. So I added the IPv6 addresses on the interfaces and added all nodes in all the /etc/hosts files. I've had issues with not being able to start clusters because hostnames could not resolve, so all my nodes in all my clusters have all the hostnames and addresses of their respective peers in /etc/hosts. 2: I upgraded pve-firewall on all the nodes, no issues there 3: I started dist-upgrading on proxmox01 and proxmox02, and restarting pve-firewall with `pve-firewall restart` because of [1] and noticed that pvecm status did not list any of the other nodes in list of peers. So we had: proxmox01: proxmox01 proxmox02: proxmox02 proxmox03-proxmox09: proxmox03-proxmox09 Obviously, /etc/pve was readonly on proxmox01 and proxmox02, since they had no quorum. 4: HA is heavily used on this cluster. Just about all VM's have it enabled. So since 'I changed nothing', I restarted pve-cluster a few times on the broken nodes. Nothing helped. 4: I then restarted pve-cluster on proxmox03, and all of the sudden, proxmox01 looked happy again. 5: In the meantime, ha-manager had kicked in and started VM's on other nodes, but did not actually let proxmox01 fence itself, but I did not notice this. 6: I tried restarting pve-cluster on yet another node, and then all nodes except proxmox01 and proxmox02 fenced themselves, rebooting alltogether. After rebooting, the cluster was not completely happy, because the firewall was still confused. So why was this firewall confused? Nothing changed, remember? Well, nothing except bullet 1. It seems that pve-firewall tries to detect localnet, but failed to do so correct. localnet should be 192.168.1.0/24, but instead it detected the IPv6 addresses. Which isn't entirely incorrect, but IPv6 is not used for clustering, so I should open IPv4 in the firewall not IPv6. So it seems like nameresolving is used to define localnat, and not what corosync is actually using. I fixed the current situation by adding the correct [ALIASES] in cluster.fw, and now all is well (except for the broken VM's that were running on two nodes and have broken images). So I think there are two issues here: 1: pve-firewall should better detect the IP's used for essential services 2: ha-manager should not be able to start the VM's when they are running elsewhere Obviously, this is a faulty situation which causes unexpected results. Again, I'm not pointing fingers, I would like to discuss how we can improve these kind of faulty situations. In the attachment, you can find a log with dpkg, pmxcfs, pve-ha-(lc)rm from all nodes. So maybe someone can better asses what went wrong. [1]: https://bugzilla.proxmox.com/show_bug.cgi?id=1823 -- Mark Schouten | Tuxis B.V. KvK: 74698818 | http://www.tuxis.nl/ T: +31 318 200208 | [email protected] _______________________________________________ pve-user mailing list [email protected] https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
