Re: [PVE-User] Cluster disaster

Dhaussy Alexandre Wed, 09 Nov 2016 14:47:39 -0800

I had again another outage...
BUT now everything is back online ! yay !

So i think i had (at least) two problems :


1 - When installing/upgrading a node.

If the node sees all SAN storages LUN before install, debian 
partitionner tries to scan all LUNs..
This causes almost all nodes to reboot (not sure why, maybe it causes 
latency in lvm cluster, or a problem with a lock somewhere..)

Same thing happens when f*$king os_prober spawns out on kernel upgrade.
It scans all LVs and causes nodes reboots. So now i make sure of this in 
/etc/default/grub => GRUB_DISABLE_OS_PROBER=true

2 - There seems to be a bug in lrm.

Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active.
Just after the timeouts, lrm was kind of stuck doing nothing.
Services began to start again after i restarted the service, anyway a 
few seconds after, the nodes got fenced.

I think the timeouts are due to a bottlenet in our storage switchs, i 
have a few messages like this :

Nov  9 22:34:40 proxmoxt25 kernel: [ 5389.318716] qla2xxx 
[0000:08:00.1]-801c:2: Abort command issued nexus=2:2:28 --  1 2002.
Nov  9 22:34:41 proxmoxt25 kernel: [ 5390.482259] qla2xxx 
[0000:08:00.1]-801c:2: Abort command issued nexus=2:1:28 --  1 2002.

So when all nodes rebooted, i may have hit the bottleneck, then the lrm 
bug, and all ha services were frozen... (happened several times.)


Thanks again for the help.
Alexandre.

Le 09/11/2016 à 20:54, Thomas Lamprecht a écrit :
>
>
> On 09.11.2016 18:05, Dhaussy Alexandre wrote:
>> I have done a cleanup of ressources with echo "" >
>> /etc/pve/ha/resources.cfg
>>
>> It seems to have resolved all problems with inconsistent status of
>> lrm/lcm in the GUI.
>>
>
> Good. Logs would be interesting to see what went wrong but I do not
> know if I can skim through them as your setup is not too small and there
> may be much noise from the outage in there.
>
> If you have time you may sent me the log file(s) generated by:
>
> journalctl --since "-2 days" -u corosync -u pve-ha-lrm -u pve-ha-crm 
> -u pve-cluster  > pve-log-$(hostname).log
>
> (adapt the "-2 days" accordingly, it understands also something like, 
> "-1 day 3 hours")
>
> Sent them directly to my address (The list does not accepts bigger 
> attachments,
> limit is something like 20-20 kb AFAIK).
> I cannot promise any deep examination, but I can skim through them and
> look what happened in the HA stack, maybe I see something obvious.
>
>> A new master have been elected. The manager_status file have been
>> cleaned up.
>> All nodes are idle or active.
>>
>> I am re-starting all vms in ha with "ha manager add".
>> Seems to work now... :-/
>>
>> Le 09/11/2016 à 17:40, Dhaussy Alexandre a écrit :
>>> Sorry my old message was too big...
>>>
>>> Thanks for the input !...
>>>
>>> I have attached manager_status files.
>>> .old is the original file, and .new is the file i have modified and put
>>> in /etc/pve/ha.
>>>
>>> I know this is bad but here's what i've done :
>>>
>>> - delnode on known NON-working nodes.
>>> - rm -Rf /etc/pve/nodes/x for all NON-working nodes.
>>> - replace all NON-working nodes with working nodes in
>>> /etc/pve/ha/manager_status
>>> - mv VM.conf files in the proper node directory
>>> (/etc/pve/nodes/x/qemu-server/) in reference to 
>>> /etc/pve/ha/manager_status
>>> - restart pve-ha-crm and pve-ha-lrm on all nodes
>>>
>>> Now on several nodes i have thoses messages :
>>>
>>> nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
>>> wait_for_quorum
>>> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
>>> Noeud final de transport n'est pas connecté
>>> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
>>> Connexion refusée
>>> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
>>> Connexion refusée
>>>
>
>
> This means that something with the cluster filesystem (pve-cluster) 
> was not OK.
> Those messages weren't there previously?
>
>
>>> nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
>>> wait_for_agent_lock
>>> nov. 09 17:12:07 proxmoxt34 pve-ha-lrm[26282]: ipcc_send_rec failed:
>>> Noeud final de transport n'est pas connecté
>>>
>>> We are also investigating on a possible network problem..
>>>
>
> Multicast properly working?
>
>
>>> Le 09/11/2016 à 17:00, Thomas Lamprecht a écrit :
>>>> Hi,
>>>>
>>>> On 09.11.2016 16:29, Dhaussy Alexandre wrote:
>>>>> I try to remove from ha in the gui, but nothing happends.
>>>>> There are some services in "error" or "fence" state.
>>>>>
>>>>> Now i tried to remove the non-working nodes from the cluster... but i
>>>>> still see those nodes in /etc/pve/ha/manager_status.
>>>> Can you post the manager status please?
>>>>
>>>> Also, is pve-ha-lrm and pve-ha-crm up and running without any error
>>>> on all nodes, at least on those in the quorate partition?
>>>>
>>>> check with:
>>>> systemctl status pve-ha-lrm
>>>> systemctl status pve-ha-crm
>>>>
>>>> If not restart them, and if then its still problematic please post the
>>>> output
>>>> of the systemctl status call (if its the same on all node one output
>>>> should be enough).
>>>>
>>>>
>>>>> Le 09/11/2016 à 16:13, Dietmar Maurer a écrit :
>>>>>>> I wanted to remove vms from HA and start the vms locally, but I
>>>>>>> can’t even do
>>>>>>> that (nothing happens.)
>>>> You can remove them from HA by emptying the HA resource file (this
>>>> deletes also
>>>> comments and group settings, but if you need to start them _now_ that
>>>> shouldn't be a problem)
>>>>
>>>> echo "" > /etc/pve/ha/resources.cfg
>>>>
>>>> Afterwards you should be able to start them manually.
>>>>
>>>>
>>>>>> How do you do that exactly (on the GUI)? You should be able to start
>>>>>> them
>>>>>> manually afterwards.
>>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> [email protected]
>>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> [email protected]
>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> [email protected]
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> _______________________________________________
>> pve-user mailing list
>> [email protected]
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>
> _______________________________________________
> pve-user mailing list
> [email protected]
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Cluster disaster

Reply via email to