Re: [PVE-User] Cluster disaster

Thomas Lamprecht Thu, 10 Nov 2016 02:40:58 -0800

On 11/09/2016 11:46 PM, Dhaussy Alexandre wrote:

I had again another outage...
BUT now everything is back online ! yay !


So i think i had (at least) two problems :

1 - When installing/upgrading a node.

If the node sees all SAN storages LUN before install, debian
partitionner tries to scan all LUNs..
This causes almost all nodes to reboot (not sure why, maybe it causes
latency in lvm cluster, or a problem with a lock somewhere..)

Same thing happens when f*$king os_prober spawns out on kernel upgrade.
It scans all LVs and causes nodes reboots. So now i make sure of this in
/etc/default/grub => GRUB_DISABLE_OS_PROBER=true

Yes OS_PROBER is _bad_ and may even corrupt some FS under someconditions, AFAIK.

The Proxmox VE iso does not have it for this reason.


2 - There seems to be a bug in lrm.

Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active.
Just after the timeouts, lrm was kind of stuck doing nothing.


If it's doing nothing it would be interesting to see in which state it is.

Because if it's already online and active the watchdog must trigger ifit is stuck for ~60 seconds or more.

Services began to start again after i restarted the service, anyway a
few seconds after, the nodes got fenced.


Hmm, this means the watchdog was already running out.

I think the timeouts are due to a bottlenet in our storage switchs, i
have a few messages like this :

Nov  9 22:34:40 proxmoxt25 kernel: [ 5389.318716] qla2xxx
[0000:08:00.1]-801c:2: Abort command issued nexus=2:2:28 --  1 2002.
Nov  9 22:34:41 proxmoxt25 kernel: [ 5390.482259] qla2xxx
[0000:08:00.1]-801c:2: Abort command issued nexus=2:1:28 --  1 2002.

So when all nodes rebooted, i may have hit the bottleneck, then the lrm
bug, and all ha services were frozen... (happened several times.)

Yeah I looked a bit through logs of two of your nodes, it looks like thesystem hit quite some bottle necks..CRM/LRM run often in 'loop took to long' errors the filesystem also issometimes not writable.

You have in some logs some huge retransmit list from corosync.

Where does your cluster communication happens, not on the storage network?


A few general hints:

The ha-stack does not likes it when somebody moves the VM configs aroundfrom a VM in the started/migrate state.If it's in stopped it's OK as there it can fixup the VM location. Elseit cannot simply fixup the location as it does not know if the resourcestill runs on the (old) node.

Modifying the manager status does not works, if a manager is currentlyelected.The manager reads it only on it transition from slave to manager to getthe last state in memory.After that it writes it just out so that on a master reelection the newmaster has the most current state.


So if something bad as this happens again I'd to the following:

If no master election happen, but there is a quorate parition of nodesand you are sure that thier pve-ha-crm service is up and running (elserestart it first) you can try to trigger an instant master reelection bydeleting the olds masters lock (which may not yet be invalid throughtimeout):

rmdir /etc/pve/priv/lock/ha_manager_lock/

If then a master election happens you should be fine and the HA stackwill do its work and recover.

If you have to move the VMs you should disable those primary, ha-managerdisable SID does that also quite well in a lot of problematic situationsas it just edits the resources.cfg.If this does not work you have no quorum or pve-cluster has a problem,which both mean HA recovery cannot take place on this node one way orthe other.


Thanks again for the help.
Alexandre.

Le 09/11/2016 à 20:54, Thomas Lamprecht a écrit :


On 09.11.2016 18:05, Dhaussy Alexandre wrote:

I have done a cleanup of ressources with echo "" >
/etc/pve/ha/resources.cfg

It seems to have resolved all problems with inconsistent status of
lrm/lcm in the GUI.

Good. Logs would be interesting to see what went wrong but I do not
know if I can skim through them as your setup is not too small and there
may be much noise from the outage in there.

If you have time you may sent me the log file(s) generated by:

journalctl --since "-2 days" -u corosync -u pve-ha-lrm -u pve-ha-crm
-u pve-cluster  > pve-log-$(hostname).log

(adapt the "-2 days" accordingly, it understands also something like,
"-1 day 3 hours")

Sent them directly to my address (The list does not accepts bigger
attachments,
limit is something like 20-20 kb AFAIK).
I cannot promise any deep examination, but I can skim through them and
look what happened in the HA stack, maybe I see something obvious.

A new master have been elected. The manager_status file have been
cleaned up.
All nodes are idle or active.

I am re-starting all vms in ha with "ha manager add".
Seems to work now... :-/

Le 09/11/2016 à 17:40, Dhaussy Alexandre a écrit :

Sorry my old message was too big...

Thanks for the input !...

I have attached manager_status files.
.old is the original file, and .new is the file i have modified and put
in /etc/pve/ha.

I know this is bad but here's what i've done :

- delnode on known NON-working nodes.
- rm -Rf /etc/pve/nodes/x for all NON-working nodes.
- replace all NON-working nodes with working nodes in
/etc/pve/ha/manager_status
- mv VM.conf files in the proper node directory
(/etc/pve/nodes/x/qemu-server/) in reference to
/etc/pve/ha/manager_status
- restart pve-ha-crm and pve-ha-lrm on all nodes

Now on several nodes i have thoses messages :

nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
wait_for_quorum
nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
Noeud final de transport n'est pas connecté
nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
Connexion refusée
nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
Connexion refusée


This means that something with the cluster filesystem (pve-cluster)
was not OK.
Those messages weren't there previously?

nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
wait_for_agent_lock
nov. 09 17:12:07 proxmoxt34 pve-ha-lrm[26282]: ipcc_send_rec failed:
Noeud final de transport n'est pas connecté

We are also investigating on a possible network problem..

Multicast properly working?

Le 09/11/2016 à 17:00, Thomas Lamprecht a écrit :

Hi,

On 09.11.2016 16:29, Dhaussy Alexandre wrote:

I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.

Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.

Can you post the manager status please?

Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?

check with:
systemctl status pve-ha-lrm
systemctl status pve-ha-crm

If not restart them, and if then its still problematic please post the
output
of the systemctl status call (if its the same on all node one output
should be enough).

Le 09/11/2016 à 16:13, Dietmar Maurer a écrit :

I wanted to remove vms from HA and start the vms locally, but I
can’t even do
that (nothing happens.)

You can remove them from HA by emptying the HA resource file (this
deletes also
comments and group settings, but if you need to start them _now_ that
shouldn't be a problem)

echo "" > /etc/pve/ha/resources.cfg

Afterwards you should be able to start them manually.

How do you do that exactly (on the GUI)? You should be able to start
them
manually afterwards.

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user



_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Cluster disaster

Reply via email to