Re: [PVE-User] Cluster disaster

Dhaussy Alexandre Wed, 09 Nov 2016 09:06:31 -0800

I have done a cleanup of ressources with  echo "" > 
/etc/pve/ha/resources.cfg


It seems to have resolved all problems with inconsistent status of 
lrm/lcm in the GUI.

A new master have been elected. The manager_status file have been 
cleaned up.
All nodes are idle or active.

I am re-starting all vms in ha with "ha manager add".
Seems to work now... :-/

Le 09/11/2016 à 17:40, Dhaussy Alexandre a écrit :
> Sorry my old message was too big...
>
> Thanks for the input !...
>
> I have attached manager_status files.
> .old is the original file, and .new is the file i have modified and put
> in /etc/pve/ha.
>
> I know this is bad but here's what i've done :
>
> - delnode on known NON-working nodes.
> - rm -Rf /etc/pve/nodes/x for all NON-working nodes.
> - replace all NON-working nodes with working nodes in
> /etc/pve/ha/manager_status
> - mv VM.conf files in the proper node directory
> (/etc/pve/nodes/x/qemu-server/) in reference to /etc/pve/ha/manager_status
> - restart pve-ha-crm and pve-ha-lrm on all nodes
>
> Now on several nodes i have thoses messages :
>
> nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
> wait_for_quorum
> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
> Noeud final de transport n'est pas connecté
> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
> Connexion refusée
> nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
> Connexion refusée
>
> nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
> wait_for_agent_lock
> nov. 09 17:12:07 proxmoxt34 pve-ha-lrm[26282]: ipcc_send_rec failed:
> Noeud final de transport n'est pas connecté
>
> We are also investigating on a possible network problem..
>
> Le 09/11/2016 à 17:00, Thomas Lamprecht a écrit :
>> Hi,
>>
>> On 09.11.2016 16:29, Dhaussy Alexandre wrote:
>>> I try to remove from ha in the gui, but nothing happends.
>>> There are some services in "error" or "fence" state.
>>>
>>> Now i tried to remove the non-working nodes from the cluster... but i
>>> still see those nodes in /etc/pve/ha/manager_status.
>> Can you post the manager status please?
>>
>> Also, is pve-ha-lrm and pve-ha-crm up and running without any error
>> on all nodes, at least on those in the quorate partition?
>>
>> check with:
>> systemctl status pve-ha-lrm
>> systemctl status pve-ha-crm
>>
>> If not restart them, and if then its still problematic please post the
>> output
>> of the systemctl status call (if its the same on all node one output
>> should be enough).
>>
>>
>>> Le 09/11/2016 à 16:13, Dietmar Maurer a écrit :
>>>>> I wanted to remove vms from HA and start the vms locally, but I
>>>>> can’t even do
>>>>> that (nothing happens.)
>> You can remove them from HA by emptying the HA resource file (this
>> deletes also
>> comments and group settings, but if you need to start them _now_ that
>> shouldn't be a problem)
>>
>> echo "" > /etc/pve/ha/resources.cfg
>>
>> Afterwards you should be able to start them manually.
>>
>>
>>>> How do you do that exactly (on the GUI)? You should be able to start
>>>> them
>>>> manually afterwards.
>>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> [email protected]
>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>> _______________________________________________
>> pve-user mailing list
>> [email protected]
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
> _______________________________________________
> pve-user mailing list
> [email protected]
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Cluster disaster

Reply via email to